Journal  of  Machine  Learning  Research  10  (2009)  2777-2836 


Submitted  2/09;  Revised  8/09;  Published  12/09 


Cautious  Collective  Classification 


Luke  K.  McDowell  lmcdowel@usna.edu 

Department  of  Computer  Science 
U.S.  Naval  Academy 
Annapolis,  MD  21402,  USA 

Kalyan  Moy  Gupta  kalyan.gupta@knexusresearch.com 

Knexus  Research  Corporation 
Springfield,  VA  22153,  USA 

David  W.  Aha  david.aha@nrl.navy.mil 

Navy  Center  for  Applied  Research  in  Artificial  Intelligence 
Naval  Research  Laboratory  ( Code  5514) 

Washington,  DC  20375,  USA 


Editor:  Michael  Collins 


Abstract 

Many  collective  classification  (CC)  algorithms  have  been  shown  to  increase  accuracy  when  in¬ 
stances  are  interrelated.  However,  CC  algorithms  must  be  carefully  applied  because  their  use  of 
estimated  labels  can  in  some  cases  decrease  accuracy.  In  this  article,  we  show  that  managing  this 
label  uncertainty  through  cautious  algorithmic  behavior  is  essential  to  achieving  maximal,  robust 
performance.  First,  we  describe  cautious  inference  and  explain  how  four  well-known  families  of 
CC  algorithms  can  be  parameterized  to  use  varying  degrees  of  such  caution.  Second,  we  introduce 
cautious  learning  and  show  how  it  can  be  used  to  improve  the  performance  of  almost  any  CC  al¬ 
gorithm,  with  or  without  cautious  inference.  We  then  evaluate  cautious  inference  and  learning  for 
the  four  collective  inference  families,  with  three  local  classifiers  and  a  range  of  both  synthetic  and 
real-world  data.  We  find  that  cautious  learning  and  cautious  inference  typically  outperform  less 
cautious  approaches.  In  addition,  we  identify  the  data  characteristics  that  predict  more  substantial 
performance  differences.  Our  results  reveal  that  the  degree  of  caution  used  usually  has  a  larger  im¬ 
pact  on  performance  than  the  choice  of  the  underlying  inference  algorithm.  Together,  these  results 
identify  the  most  appropriate  CC  algorithms  to  use  for  particular  task  characteristics  and  explain 
multiple  conflicting  findings  from  prior  CC  research. 

Keywords:  collective  inference,  statistical  relational  learning,  approximate  probabilistic  infer¬ 
ence,  networked  data,  cautious  inference 


1.  Introduction 

Traditional  methods  for  supervised  learning  assume  that  the  instanees  to  be  elassified  are  indepen¬ 
dent  of  eaeh  other.  However,  in  many  elassifieation  tasks,  instanees  ean  be  related.  For  example, 
hyperlinked  web  pages  are  more  likely  to  have  the  same  elass  label  than  unlinked  pages.  Sueh 
autoeorrelation  (eorrelation  of  elass  labels  among  interrelated  instanees)  exists  in  a  wide  variety 
of  data  (Neville  and  Jensen,  2007;  Maeskassy  and  Provost,  2007),  ineluding  situations  where  the 
relationships  are  implieit  (e.g.,  email  messages  between  two  people  are  likely  to  share  topies). 


©2009  Luke  K.  McDowell,  Kalyan  Moy  Gupta,  and  David  W.  Aha. 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

AUG  2009  2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2009  to  00-00-2009 

4.  TITLE  AND  SUBTITLE 

Cautious  Collective  Classification 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Massachusetts  Institute  of  Technology, 77  Massachusetts 

Avenue, Cambridge, MA, 02139 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

see  report 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF:  17.  LIMITATION  OF 

_ _ _  ABSTRACT 

18.  NUMBER  19a.  NAME  OF 

OF  PAGES  RESPONSIBLE  PERSON 

a.  REPORT  b.  ABSTRACT  c.  THIS  PAGE  Same  aS 

unclassified  unclassified  unclassified  Report  (SAR) 

60 

Standard  Form  298  (Rev.  8-98} 

Prescribed  by  ANSI  Std  Z39-18 


McDowell,  Gupta  and  Aha 


Collective  classification  (CC)  is  a  method  for  jointly  classifying  related  instances.  To  do  so, 
CC  methods  employ  a  collective  inference  algorithm  that  exploits  dependencies  between  instances 
(e.g.,  autocorrelation),  enabling  CC  to  often  attain  higher  accuracies  than  traditional  methods  when 
instances  are  interrelated  (Neville  and  Jensen,  2000;  Taskar  et  ah,  2002;  Jensen  et  ah,  2004;  Sen 
et  ah,  2008).  Several  algorithms  have  been  used  for  collective  inference,  including  relaxation  label¬ 
ing  (Chakrabarti  et  ah,  1998),  the  iterative  classification  algorithm  (/CA)  (Lu  and  Getoor,  2003a), 
loopy  belief  propagation  (LBP)  (Taskar  et  ah,  2002),  Gibbs  sampling  (Gibbs)  (Jensen  et  ah,  2004), 
and  variants  of  the  weighted-vote  relational  neighbor  algorithm  (wvRN)  (Macskassy  and  Provost, 

2007) . 

During  testing,  all  collective  inference  algorithms  exploit  relational  features  based  on  uncertain 
estimation  of  class  labels.  This  test-time  label  uncertainty  can  diminish  accuracy  due  to  two  related 
effects.  First,  an  incorrectly  predicted  label  during  testing  may  negatively  influence  the  predictions 
of  its  linked  neighbors,  possibly  leading  to  cascading  inference  errors  (cf.,  Neville  and  Jensen, 

2008) .  Second,  the  training  process  may  learn  a  poor  model  for  test-time  inference,  because  of  the 
disparity  between  the  training  scenario  (where  labels  are  known  and  certain)  and  the  test  scenario 
(where  labels  are  estimated  and  hence  possibly  incorrect).  As  a  result,  while  CC  has  many  potential 
advantages,  in  some  cases  CC’s  label  uncertainty  may  actually  cause  accuracy  to  decrease  compared 
to  non-relational  approaches  (Neville  and  Jensen,  2007;  Sen  and  Getoor,  2006;  Sen  et  ah,  2008). 

In  this  article,  we  argue  that  managing  this  test-time  label  uncertainty  through  “cautious”  al¬ 
gorithmic  behavior  is  essential  to  achieving  maximal,  robust  performance.  We  describe  two  com¬ 
plementary  cautious  strategies.  Each  addresses  the  fundamental  problem  of  label  uncertainty,  but 
separately  targets  the  two  manifestations  of  the  problem  described  above.  First,  cautious  infer¬ 
ence  is  an  inference  process  that  attends  to  the  uncertainty  of  its  intermediate  label  predictions. 
For  example,  existing  algorithms  such  as  Gibbs  or  LBP  accomplish  cautious  inference  by  sampling 
from  or  directly  reasoning  with  the  estimated  label  distributions.  These  techniques  are  cautious 
because  they  prevent  less  certain  label  estimates  from  having  substantial  influence  on  subsequent 
estimations.  Alternatively,  we  show  how  variants  of  a  simpler  algorithm,  ICA,  can  perform  cautious 
inference  by  appropriately  favoring  more  certain  information.  Second,  cautious  learning  refers  to  a 
training  process  that  ameliorates  the  aforementioned  train/test  disparity.  In  particular,  we  introduce 
PFUF  (Parameter  Fearning  for  Uncertain  Fabels),  which  uses  standard  cross-validation  techniques, 
but  in  a  way  that  is  new  for  CC  and  that  leads  to  significant  performance  advantages.  In  particu¬ 
lar,  PFUF  is  cautious  because  it  prevents  the  algorithm  from  learning  a  model  from  the  (correctly 
labeled)  training  set  that  overestimates  how  useful  relational  features  will  be  when  computed  with 
uncertain  labels  from  the  test  set. 

We  consider  four  frequently-studied  families  of  CC  algorithms:  ICA,  Gibbs,  LBP,  and  wvRN. 
For  each  family,  we  describe  algorithms  that  use  varying  degrees  of  cautious  inference  and  explain 
how  they  all  (except  for  the  relational-only  wvRN)  can  also  exploit  cautious  learning  via  PFUF. 
We  then  evaluate  the  variants  of  these  four  families,  with  and  without  PFUF,  over  a  wide  range  of 
synthetic  and  real-world  data  sets.  To  broaden  the  evidence  for  our  results,  we  evaluate  three  local 
classifiers  that  are  used  by  some  of  the  CC  algorithms,  and  also  compare  against  a  non-relational 
baseline. 

While  recent  CC  studies  describe  complementary  results  and  make  some  related  comparisons, 
they  omit  important  variations  that  we  consider  here  (see  Section  3).  Moreover,  the  scope  and/or 
methodology  of  previous  studies  leaves  several  important  questions  unanswered.  For  instance, 
Gibbs  is  often  regarded  as  one  of  the  most  accurate  inference  algorithms,  and  has  been  shown 


2778 


Cautious  Collective  Classieication 


to  work  well  for  CC  (Jensen  et  al.,  2004;  Neville  and  Jensen,  2007).  If  so,  why  did  Sen  et  al. 
(2008)  find  no  signifieant  differenee  between  Gibbs  and  the  mueh  less  sophistieated  ICAl  Seeond, 
we  earlier  reported  that  ICAc  (a  eautious  variant  of  ICA)  outperforms  both  Gibbs  and  ICA  on  three 
real-world  data  sets  (MeDowell  et  ah,  2007a).  Why  would  ICAc  outperform  Gibbs,  and  for  what 
data  eharaeteristies  are  ICAcA  gains  signifieant?  We  answer  these  questions  and  more  in  Seetion  8. 

We  hypothesize  that  cautious  CC  algorithms  will  outperform  more  aggressive  CC  approaches 
when  there  exists  a  high  probability  of  an  “incorrect  relational  inference  ”,  whieh  we  define  as  a  pre- 
dietion  error  that  is  due  to  reasoning  with  relational  features  (i.e.,  an  error  that  does  not  oeeur  when 
relational  features  are  removed).  Two  kinds  of  data  eharaeteristies  inerease  the  likelihood  of  sueh 
errors.  First,  when  the  data  eharaeteristies  lead  to  lower  overall  elassifieation  aeeuraey  (e.g.,  when 
the  non-relational  attributes  are  not  highly  predietive),  then  the  eomputed  relational  feature  values 
will  be  less  reliable.  Seeond,  when  a  typieal  relational  link  is  highly  predietive  (e.g.,  as  oeeurs  when 
the  data  exhibits  high  relational  autocorrelation),  then  the  potential  effeet  of  any  ineorreet  predie- 
tion  is  magnified.  As  the  magnitude  of  either  of  these  data  set  eharaeteristies  inereases,  eautious 
algorithms  should  outperform  more  aggressive  algorithms  by  an  inereasing  amount. 

Our  eontributions  are  as  follows.  First,  we  deseribe  eautious  inferenee  and  how  four  eommonly- 
used  families  of  existing  CC  inferenee  algorithms  ean  exhibit  more  or  less  eaution.  Seeond,  we 
introduee  eautious  learning  and  explain  how  it  ean  help  eompensate  for  the  train/test  disparity  that 
oeeurs  when  a  CC  algorithm  uses  estimated  elass  labels  during  testing.  Third,  we  identify  the  data 
eharaeteristies  for  whieh  these  eautious  teehniques  should  outperform  more  aggressive  approaehes, 
as  introdueed  in  the  preeeding  paragraph  and  diseussed  in  more  detail  in  Seetion  6.  Our  experi¬ 
mental  results  eonfirm  that  eautious  approaehes  typieally  do  outperform  less  eautious  variants,  and 
that  these  effeets  grow  larger  when  there  is  a  greater  probability  of  ineorreet  relational  inferenee. 
Moreover,  our  results  reveal  that  in  most  eases  the  degree  of  caution  used  has  a  larger  impact  on 
performance  than  the  choice  of  the  underlying  inference  algorithm.  In  partieular,  the  eautious  algo¬ 
rithms  perform  very  similarly,  regardless  of  whether  ICAq  or  Gibbs  or  LBP  is  used,  although  our 
results  also  eonfirm  that,  for  some  data  eharaeteristies,  inferenee  with  LBP  performs  eomparatively 
poorly.  These  results  suggest  that  in  many  eases  the  higher  eomputational  eomplexity  of  Gibbs  and 
LBP  is  unneeessary,  and  that  the  mueh  faster  ICAq  should  be  used  instead.  Finally,  our  results  and 
analysis  enable  us  to  answer  the  previously  mentioned  questions  regarding  CC. 

The  next  two  seetions  summarize  eolleetive  elassifieation  and  related  work.  Seetion  4  then 
explains  why  CC  needs  to  be  eautious  and  deseribes  eautious  inferenee  and  learning  in  more  detail. 
In  Seetion  5,  we  deseribe  how  eaution  ean  be  speeifieally  used  by  the  four  families  of  CC  inferenee 
algorithms.  Seetion  6  then  deseribes  our  methodology  and  hypotheses.  Seetion  7  presents  our 
results,  whieh  we  diseuss  in  Seetion  8.  We  eonelude  in  Seetion  9. 

2.  Collective  Classification:  Description  and  Problem  Definition 

In  this  section,  we  first  motivate  and  define  collective  classification  (CC).  We  then  describe  different 
approaches  to  CC,  different  CC  tasks,  and  our  assumptions  for  this  article. 

2.1  Problem  Statement  and  Example 

In  many  domains,  relations  exist  among  instances  (e.g.,  among  hyperlinked  web  pages,  social  net¬ 
work  members,  co-cited  publications).  These  relations  may  be  helpful  for  classification  tasks,  such 


2779 


McDowell,  Gupta  and  Aha 


as  predicting  the  topic  of  a  publication  or  the  group  membership  of  a  person  (Koller  et  ah,  2007). 
More  formally,  we  consider  the  following  task  (based  on  Macskassy  and  Provost,  2006): 


Definition  1  (Classificationof  Graph-based  Data)  Assume  we  are  given  a  graph  G  =  {V,E,X,Y,C) 
where  V  is  a  set  of  nodes,  E  is  set  of  (possibly  directed)  edges,  each  x,-  €  X  is  an  attribute  vector  for 
node  Vi  G  V,  each  Yt  (zY  is  a  label  variable  for  v,-,  and  C  is  the  set  of  possible  labels.  Assume  further 
that  we  are  given  a  set  of  “known ”  values  Y^  for  nodes  C  V,  so  that  Y^  =  {yi\vi  G  Then  the 
task  is  to  infer  Y^ ,  the  values  ofYifor  the  remaining  nodes  with  “unknown  ”  values  (V^  =  V  —  V^), 
or  a  probability  distribution  over  those  values.  ^ 


For  example,  consider  the  task  of  predicting  whether  a  web  page  belongs  to  a  professor  or  a  stu¬ 
dent.  Conventional  supervised  learning  approaches  ignore  the  link  relations  and  classify  each  page 
using  attributes  derived  from  its  content  (e.g.,  words  present  in  the  page).  We  refer  to  this  approach 
as  non-relational  classification.  In  contrast,  a  technique  for  relational  classification  would  explicitly 
use  the  links  to  construct  additional  relational  features  for  classification  (e.g.,  for  each  page,  includ¬ 
ing  as  features  the  words  from  hyperlinked  pages).  This  additional  information  can  potentially  in¬ 
crease  classification  accuracy,  though  may  sometimes  decrease  accuracy  as  well  (Chakrabarti  et  ah, 
1998).  Alternatively,  even  greater  (and  usually  more  reliable)  increases  can  occur  when  the  class 
labels  of  the  linked  pages  are  used  instead  to  derive  relevant  relational  features  (Jensen  et  ah,  2004). 
However,  using  features  based  on  these  labels  is  challenging,  because  some  or  all  of  the  labels  are 
initially  unknown,  and  thus  typically  must  be  estimated  and  then  iteratively  refined  in  some  way. 
This  process  of  jointly  inferring  the  labels  of  interrelated  nodes  is  known  as  collective  classification 
(CC). 

Figure  1  summarizes  an  example  execution  of  a  simple  CC  algorithm,  ICA,  applied  to  the  binary 
web  page  classification  task.  Each  step  in  the  sequence  displays  a  graph  of  four  nodes,  where  each 
node  denotes  a  web  page,  and  hyperlinks  among  them.  Each  node  has  a  class  label  y,;  the  set  of 
possible  class  labels  is  C  =  denoting  professors  and  students,  respectively.  Three  nodes  have 

unknown  labels  (V^  =  {vi,V2,V4})  and  one  node  has  a  known  label  (V^  =  {V3}).  In  the  initial  state 
(step  A),  no  label  y,  has  yet  been  estimated  for  the  nodes  in  V^,  so  each  is  set  to  missing  (indicated 
by  a  question  mark).  Each  node  has  three  binary  attributes  (represented  by  x,).  Nodes  in  also 
have  two  relational  features  (one  per  class),  represented  by  the  vector  f.  Each  feature  denotes  the 
number  of  linked  nodes  (ignoring  link  direction)  that  have  a  particular  class  label. 

In  step  B,  some  classifier  (not  shown)  estimates  class  labels  for  nodes  in  using  only  the  (non¬ 
relational)  attributes.  These  labels,  along  with  the  known  label  y3,  are  used  in  step  C  to  compute 
the  relational  feature  value  vectors.  Eor  instance,  in  step  C,  /2  =  (1  2)  because  V2  links  to  nodes 
with  one  current  P  label  and  two  current  S  labels.  In  step  D,  a  classifier  re-estimates  the  labels  using 
both  attributes  and  relational  features,  which  changes  the  predicted  label  of  V2.  In  step  E,  relational 
feature  values  are  re-computed  using  the  new  labels.  Steps  D  and  E  then  repeat  until  a  termination 
criterion  is  satisfied  (e.g.,  convergence,  number  of  iterations). 

This  example  exhibits  how  relational  value  uncertainty  occurs  with  CC.  Eor  instance,  the  feature 
vector  /i  is  (1  0)  in  step  C  but  later  becomes  (0  1).  Thus,  intermediate  predictions  use  uncertain 
label  estimates,  motivating  the  need  to  cautiously  use  such  estimates. 

1.  may  be  empty.  In  addition,  a  separate  training  graph  may  be  provided;  see  Section  2.3. 


2780 


Cautious  Collective  Classification 


A.)  Initial  State 


B.)  Ciassify  (use  attributes  only) 


C.)  Compute  rei.  feature  values 


(repeat  steps  D  and  E...) 


Figure  1:  Example  operation  of  ICA,  a  simple  CC  algorithm.  Each  step  (A  thru  E)  shows  a  graph  of 
4  linked  nodes  (i.e.,  web  pages).  “Known”  values  are  are  shown  in  white  text  on  a  black 
background;  this  includes  all  attribute  values  3c,-  and  the  class  label  ja  for  V3.  Estimated 
values  are  shown  instead  with  a  white  background. 


2.2  Algorithms  for  Collective  Inference 

Eor  some  collective  inference  tasks,  exact  methods  such  as  junction  trees  (Huang  and  Darwiche, 
1996)  or  variable  elimination  (Zhang  and  Poole,  1996)  can  be  applied.  However,  these  methods 
may  be  prohibitively  expensive  to  use  (e.g.,  summing  over  the  remaining  variable  configurations  is 
intractable  for  modest-sized  graphs).  Some  research  has  focused  on  methods  that  further  factorize 
the  variables,  and  then  apply  an  exact  procedure  such  as  belief  propagation  (Neville  and  Jensen, 
2005),  min-cut  partition  (Barzilay  and  Eapata,  2005),  or  methods  for  solving  quadratic  and  linear 
programs  (Triebel  et  ah,  2007).  In  this  article,  we  consider  only  approximate  collective  inference 
methods. 

We  consider  three  primary  types  of  approximate  collective  inference  algorithms,  borrowing 
some  terminology  from  Sen  et  al.  (2008): 


•  Local  classifier-based  methods.  Eor  these  methods,  inference  is  an  iterative  process  whereby 
a  local  classifier  predicts  labels  for  each  node  in  using  both  attributes  and  relational  fea¬ 
tures  (derived  from  the  current  label  predictions),  and  then  a  collective  inference  algorithm 
recomputes  the  class  labels,  which  will  be  used  in  the  next  iteration.  Examples  of  this  type 
of  CC  algorithm  include  ICA  (used  in  the  example  above)  and  Gibbs.  Eocal  classifiers  that 
have  been  used  include  Naive  Bayes  (Jensen  et  ah,  2004),  relational  probability  trees  (Neville 
et  ah,  2003a),  k-nearest  neighbor  (McDowell  et  ah,  2007b),  and  logistic  regression  (Sen  et  ah, 
2008).  Typically,  a  supervised  learner  induces  the  local  classifier  from  fhe  fraining  sef  using 
both  attributes  and  relational  features. 


2781 


McDowell,  Gupta  and  Aha 


•  Global  formulation-based  methods.  These  methods  train  a  elassifier  that  seeks  to  opti¬ 
mize  one  global  ohjeetive  funetion,  often  based  on  a  Markov  random  field  (Dobrushin,  1968; 
Besag,  1974).  As  above,  the  elassifier  uses  both  attributes  and  relational  features  for  infer- 
enee.  Examples  of  these  algorithms  inelude  loopy  belief  propagation  and  relaxation  labeling. 
These  do  not  use  a  separate  loeal  elassifier;  instead,  the  entire  algorithm  is  used  for  both  train¬ 
ing  (e.g.,  to  learn  the  elique  potentials)  and  inferenee.  See  Taskar  et  al.  (2002)  and  Sen  et  al. 
(2008)  for  more  details. 

•  Relational-only  methods.  Reeently,  Maeskassy  and  Provost  (2007)  demonstrated  that,  when 
some  labels  are  known  (i.e.,  \V^\  >  0),  algorithms  that  use  only  relational  information  can  in 
some  cases  perform  very  well.  We  consider  several  variants  of  the  algorithm  they  described, 
wvRNrl  (weighted-vote  relational  neighbor,  with  relaxation  labeling).  This  algorithm  com¬ 
putes  a  new  label  distribution  for  a  node  by  averaging  the  current  distributions  of  its  neighbors. 
It  does  not  require  any  training. 

With  local  classifier  methods,  learning  the  classifier  can  often  be  done  in  a  single  pass  over  the 
data,  does  not  require  running  collective  inference,  and  in  fact  is  independent  of  the  collective  infer¬ 
ence  procedure  that  will  be  used.  In  contrast,  for  global  methods  the  local  classifier  and  inference 
algorithm  are  effectively  unified.  As  a  result,  learning  for  a  global  method  requires  committing  to 
and  actually  executing  a  specific  inference  algorithm,  and  thus  can  be  much  slower  than  with  a  local 
classifier-based  method. 

All  of  these  algorithms  jointly  classify  interrelated  nodes  using  some  iterative  process.  Those 
that  propagate  from  one  iteration  to  the  next  a  single  label  for  each  node  are  called  hard-labeling 
methods.  Methods  that  instead  propagate  a  probability  distribution  over  the  possible  class  labels 
are  called  soft-labeling  methods  (cf.,  Galstyan  and  Cohen,  2007).  All  of  the  local  classifier-based 
methods  that  we  examine  are  hard-labeling  methods.^  Soft-labeling  methods,  such  as  variants  of 
relaxation  labeling,  are  also  possible  but  require  that  the  local  classifier  be  able  to  reason  directly 
with  label  distributions,  which  is  more  complex  than  the  label  aggregation  for  features  typically 
done  with  approaches  like  ICA  or  Gibbs.  Section  6.6  provides  more  detail  on  these  features. 

2.3  Task  Definitions  and  Focus 

Collective  classification  has  been  applied  to  two  types  of  inference  tasks,  namely  the  out-of-sample 
task,  where  is  empty,  and  the  in-sample  task,  where  is  not  empty.  Both  types  of  tasks 
may  emerge  in  real-world  situations  (Neville  and  Jensen,  2005).  Prior  work  on  out-of-sample  tasks 
(Neville  and  Jensen,  2000;  Taskar  et  ah,  2002;  Sen  and  Cetoor,  2006)  assume  that  the  algorithm  is 
also  provided  with  a  training  graph  Gjr  that  is  disjoint  from  the  test  graph  G.  For  instance,  a  model 
may  be  learned  over  the  web-graph  for  one  institution,  and  tested  on  the  web-graph  of  another. 

For  in-sample  tasks,  where  some  labels  in  G  are  known,  CC  can  be  applied  to  the  single  graph 
G  (Maeskassy  and  Provost,  2007;  McDowell  et  ah,  2007a;  Sen  et  ah,  2008;  Gallagher  et  ah,  2008); 
within-network  classification  (Maeskassy  and  Provost,  2006)  involves  training  on  the  subset  G^  C  G 
with  known  labels,  and  testing  by  running  inference  over  the  entire  graph.  This  task  simulates,  for 
example,  fraud  detection  in  a  single  large  telecommunication  network  where  some  entities/nodes  are 

2.  We  could  also  consider  wvRNri,  which  is  a  soft-labeling  method,  to  be  a  local  classifier-based  method,  albeit  a 
simple  one  that  ignores  attributes  and  does  no  learning.  However,  for  explication  we  list  relational-only  methods  as  a 
separate  category  in  the  list  above  because  our  results  will  show  they  often  have  rather  different  performance  trends. 


2782 


Cautious  Collective  Classieication 


known  to  be  fraudulent.  Another  in-sample  task  (Neville  and  Jensen,  2007;  Bilgie  and  Getoor,  2008; 
Neville  and  Jensen,  2008)  assumes  a  separate  training  graph  Gxr,  where  a  model  is  learned  from 
Gxr  and  inferenee  is  performed  over  the  test  graph  G,  whieh  ineludes  both  labeled  and  unlabeled 
nodes.  For  both  tasks,  predietive  aeeuraey  is  measured  only  for  the  unlabeled  nodes. 

In  Seetion  6,  we  will  address  three  types  of  tasks  (i.e.,  out-of-sample,  sparse  in-sample,  and 
dense  in-sample).  This  is  similar  to  the  set  of  tasks  addressed  in  some  previous  evaluations  (e.g., 
Neville  and  Jensen,  2007,  2008;  Bilgie  and  Getoor,  2008)  and  subsumes  some  others  (e.g.,  Neville 
and  Jensen,  2000;  Taskar  et  al.,  2002;  Sen  and  Getoor,  2006).  We  will  not  direetly  address  the 
within-network  task,  but  the  algorithmie  trends  observed  from  our  in-sample  evaluations  should  be 
similar.^ 

2.4  Assumptions  and  Limitations 

In  this  broad  investigation  on  the  utility  of  eaution  in  eolleetive  elassifieation,  we  make  several 
simplifying  assumptions.  First,  we  assume  data  is  obtained  passively  rather  than  aetively  (Rattigan 
et  al.,  2007;  Bilgie  and  Getoor,  2008).  Seeond,  we  assume  that  nodes  are  homogeneous  (e.g.,  all 
represent  the  same  kind  of  objeet)  rather  than  heterogeneous  (Neville  et  al.,  2003a;  Neville  and 
Jensen,  2007).  Third,  we  assume  that  links  are  not  missing,  and  need  not  be  inferred  (Bilgie  and 
Getoor,  2008).  Finally,  we  do  not  attempt  to  inerease  autoeorrelation  via  teehniques  sueh  as  link 
addition  (Gallagher  et  al.,  2008),  elustering  (Neville  and  Jensen,  2005),  or  problem  transformation 
(Tian  et  al.,  2006;  Triebel  et  al.,  2007). 

Our  example  in  Figure  1  employs  a  simple  relational  feature  (i.e.,  that  eounts  the  number  of 
linked  nodes  with  a  speeifie  elass  label).  However,  several  other  types  of  relations  exist.  For  ex¬ 
ample,  Gallagher  and  Eliassi-Rad  (2008)  deseribe  a  topology  of  feature  types,  ineluding  struetural 
features  that  are  independent  of  node  labels  (e.g.,  the  number  of  linked  neighbors  of  a  given  node). 
We  foeus  on  only  three  simple  types  of  relational  features  (see  Seetion  6.6),  and  leave  broader  in¬ 
vestigations  for  future  work.  Likewise,  for  CC  algorithms  that  learn,  we  assume  that  training  is 
performed  just  onee,  whieh  differs  from  some  prior  work  where  the  learned  model  is  updated  in 
eaeh  iteration  (Lu  and  Getoor,  2003b;  Gurel  and  Kersting,  2005). 

3.  Related  Work 

Besag  (1986)  originally  deseribed  the  “Iterated  Conditional  Modes”  (ICM)  algorithm,  whieh  is  a 
version  of  the  ICA  algorithm  that  we  eonsider.  Several  researehers  have  reported  that  employ¬ 
ing  inter-instanee  relations  in  CC  algorithms  ean  signifieantly  inerease  predietive  aeeuraey  (e.g., 
Chakrabarti  et  al.,  1998;  Neville  and  Jensen,  2000;  Taskar  et  al.,  2002;  Lu  and  Getoor,  2003a). 
Lurthermore,  these  algorithms  have  performed  well  on  a  variety  of  tasks,  sueh  as  identifying  seeu- 
rities  fraud  (Neville  et  al.,  2005),  ranking  suspieious  entities  (Maeskassy  and  Provost,  2005),  and 
annotating  semantie  web  serviees  (HeB  and  Kushmeriek,  2004). 

In  eaeh  iteration,  a  CC  algorithm  prediets  a  elass  label  (or  a  elass  distribution)  for  eaeh  node  and 
uses  it  to  determine  the  next  iteration’s  predietions.  Although  using  label  predietions  from  linked 
nodes  (instead  of  using  the  larger  number  of  attributes  from  linked  nodes)  eneapsulates  the  inliuenee 
of  a  linked  node  and  simplifies  learning  (Jensen  et  al.,  2004),  it  ean  be  problematie.  Lor  example, 

3.  Indeed,  we  performed  additional  experiments  where  we  reproduced  the  synthetic  data  of  Sen  et  al.  (2008),  but  then 

transformed  the  task  from  their  within-network  variant  to  a  variant  that  uses  a  separate  graph  for  training  (as  done  in 

this  article),  and  obtained  results  similar  to  those  they  reported. 


2783 


McDowell,  Gupta  and  Aha 


iterating  with  incorrectly  predicted  labels  can  propagate  and  amplify  errors  (Neville  and  Jensen, 
2007;  Sen  and  Getoor,  2006;  Sen  et  ah,  2008),  diminishing  or  even  reducing  accuracy  compared 
to  non-relational  approaches.  In  this  article,  we  examine  the  data  characteristics  (and  algorithmic 
interactions)  for  which  these  issues  are  most  serious  and  explain  how  cautious  approaches  can  ame¬ 
liorate  them. 

The  performance  of  CC  compared  to  non-relational  learners  depends  greatly  on  the  data  char¬ 
acteristics.  First,  for  CC  to  improve  performance,  the  data  must  exhibit  relational  autocorrelation 
(Jensen  et  ah,  2004;  Neville  and  Jensen,  2005;  Macskassy  and  Provost,  2007;  Rattigan  et  ah,  2007; 
Sen  et  ah,  2008),  which  is  correlation  among  the  labels  of  related  instances  (Jensen  and  Neville, 
2002).  Complex  correlations  can  be  exploited  by  some  CC  algorithms,  capturing  for  instance  the 
notion  “Professors  primarily  have  out-links  to  Students.”  In  contrast,  the  simplest  kind  of  corre¬ 
lation  is  homophily  (McPherson  et  ah,  2001),  in  which  links  tend  to  connect  nodes  with  the  same 
label.  To  facilitate  replication.  Appendix  A  defines  homophily  more  formally. 

A  second  data  characteristic  that  can  influence  CC  performance  is  attribute  predictiveness.  For 
example,  if  the  attributes  are  far  less  predictive  than  the  selected  relational  features,  then  CC  algo¬ 
rithms  should  perform  comparatively  well  vs.  traditional  algorithms  (Jensen  et  ah,  2004).  Third, 
link  density  plays  a  role  (Jensen  and  Neville,  2002;  Neville  and  Jensen,  2005;  Sen  et  ah,  2008);  if 
there  are  few  relations  among  the  instances,  then  collective  classification  may  offer  little  benefit. 
Alternatively,  algorithms  such  as  LBP  are  known  to  perform  poorly  when  link  density  is  very  high 
(Sen  and  Getoor,  2006).  Fourth,  an  important  factor  is  the  labeled  proportion  (the  proportion  of  test 
nodes  that  have  known  labels).  In  particular,  if  some  node  labels  are  known  (|F^|  >  0),  these  labels 
may  help  prevent  CC  estimation  errors  from  cascading.  In  addition,  if  a  substantial  number  of  la¬ 
bels  are  known,  simpler  relational-only  algorithms  may  be  the  most  effective.  Although  additional 
data  characteristics  exist  that  can  influence  the  performance  of  CC  algorithms,  such  as  degree  of 
disparity  (Jensen  et  ah,  2003)  and  assortativity  (Newman,  2003;  Macskassy,  2007),  we  concentrate 
on  these  four  in  our  later  evaluations. 

Compared  to  this  article,  prior  studies  provide  complementary  results  and  make  some  relevant 
comparisons,  but  do  not  examine  important  variations  that  we  consider  here.  For  instance,  Jensen 
et  al.  (2004)  only  investigate  a  single  collective  inference  algorithm,  and  Macskassy  and  Provost 
(2007)  focus  on  relational-only  (univariate)  algorithms.  Sen  et  al.  (2008)  assess  several  algorithms 
on  real  and  synthetic  data,  but  do  not  examine  the  impact  of  attribute  predictiveness  or  labeled  pro¬ 
portion.  Likewise,  Neville  and  Jensen  (2007)  evaluate  synthetic  and  real  data,  but  vary  data  char¬ 
acteristics  (autocorrelation  and  labeled  proportion)  for  only  the  synthetic  data,  do  not  consider  ICA, 
and  consider  LBP  only  for  the  synthetic  data.  In  addition,  only  one  of  these  prior  studies  (Neville 
and  Jensen,  2000)  evaluates  an  algorithm  related  to  ICAq,  which  is  a  simple  cautious  variant  of  ICA 
that  we  show  has  promising  performance.  Moreover,  these  studies  did  not  compare  algorithms  that 
vary  only  in  their  degree  of  cautious  inference,  or  use  cautious  learning. 

4.  Types  of  Caution  in  CC  and  Why  Caution  is  Important 

Section  3  described  how  collective  classification  exploits  label  predictions  to  try  to  increase  ac¬ 
curacy,  but  how  iterating  with  incorrectly  predicted  labels  can  sometimes  propagate  and  amplify 
errors.  To  address  this  problem,  we  recently  proposed  the  use  of  cautious  inference  for  CC  (Mc¬ 
Dowell  et  ah,  2007a).  We  defined  an  inference  algorithm  to  be  cautious  if  it  sought  to  “explicitly 
identify  and  preferentially  exploit  the  more  certain  relational  information.”  In  addition,  we  ex- 


2784 


Cautious  Collective  Classieication 


plained  that  a  variant  of  ICA  that  we  here  eall  ICAc  is  eautious  beeause  it  seleetively  ignores  elass 
labels  that  were  predieted  with  less  eonfidenee  by  the  loeal  elassifier.  Previously,  Neville  and  Jensen 
(2000)  introdueed  a  simpler  version^  of  ICAc  but  eompared  it  only  with  non-relational  elassifiers. 
We  showed  that  ICAc  ean  outperform  ICA  and  Gibbs,  but  did  not  identify  the  data  eonditions  under 
whieh  sueh  gains  hold. 

In  this  artiele,  we  expand  our  original  notion  of  eaution  in  two  ways.  First,  we  broaden  our 
idea  of  cautious  inference  to  eneompass  several  other  existing  CC  inferenee  teehniques  that  seek 
the  same  goal  (managing  predietion  uneertainty).  Reeognizing  the  behavioral  similarities  between 
these  different  algorithms  helps  us  to  better  assess  the  strengths  and  weaknesses  of  eaeh  algorithm 
for  a  partieular  data  set.  Seeond,  we  introduee  cautious  learning,  a  teehnique  that  ameliorates 
predietion  uneertainty  even  before  inferenee  is  applied,  whieh  ean  substantially  inerease  aeeuraey. 
Below  we  detail  these  two  types  of  eaution. 

•  A  CC  algorithm  exhibits  cautious  inference  if  its  inference  process  attends  to  the  uncertainty 
of  its  intermediate  label  predictions.  Usually,  this  uncertainty  is  approximated  via  the  pos¬ 
terior  probabilities  associated  with  each  predicted  label.  For  instance,  a  CC  algorithm  may 
exercise  cautious  inference  by  favoring  predicted  information  that  has  less  uncertainty  (higher 
confidence).  This  is  the  approach  taken  by  ICAc,  which  uses  only  the  most  certain  labels  at 
the  beginning  of  its  operation,  then  gradually  incorporates  less  certain  predictions  in  later  it¬ 
erations.  Alternatively,  instead  of  always  selecting  the  most  likely  class  label  for  each  node 
(like  ICA  and  ICAc),  Gibbs  re-samples  the  label  of  each  node  based  on  its  estimated  distribu¬ 
tion.  This  re-sampling  leads  to  more  stochastic  variability  (and  less  influence)  for  nodes  with 
less  certain  predictions.  Finally,  soft-labeling  algorithms  like  LBP,  relaxation  labeling,  and 
wvRNrl  directly  reason  with  the  estimated  label  distributions.  For  instance,  wvRNrl  averages 
the  estimated  distributions  of  a  node’s  linked  neighbors,  which  gives  more  influence  to  more 
certain  predictions. 

•  A  CC  algorithm  exhibits  cautious  learning  if  its  training  process  is  influenced  by  recogniz¬ 
ing  the  disparity  between  the  training  set  (where  labels  are  known  and  certain)  and  the  test 
scenario  (where  labels  may  be  estimated  and  hence  incorrect).  In  particular,  a  relational  fea¬ 
ture  may  appear  to  be  highly  predictive  of  the  class  when  examining  the  training  set  (e.g.,  to 
learn  conditional  probabilities  or  feature  weights),  yet  its  use  may  actually  decrease  accuracy 
if  its  value  is  often  incorrect  during  testing.  In  response,  one  approach  is  to  ensure  that  appro¬ 
priate  training  parameters  are  cross-validated  using  the  actual  testing  conditions  (e.g.,  with 
estimated  test  labels).  We  use  PLUL  to  achieve  this  goal. 

The  next  section  describes  how  these  general  ideas  can  be  applied.  Later,  our  experimental 
results  demonstrate  when  they  lead  to  significant  performance  improvements. 

5.  Applying  Cautious  Inference  and  Learning  to  Collective  Classification 

The  previous  section  described  two  types  of  caution  for  CC.  Each  attempts  to  alleviate  potential 
estimation  errors  in  labels  during  collective  inference.  Cautious  inference  and  cautious  learning 
can  often  be  combined,  and  at  least  one  is  used  or  is  applicable  to  every  CC  algorithm  known  to 

4.  Their  algorithm  is  like  ICAc,  except  that  it  does  not  consider  how  to  favor  “known”  labels  from  V^. 


2785 


McDowell,  Gupta  and  Aha 


us.  In  this  section,  we  provide  examples  of  how  both  types  can  be  applied  by  describing  specific 
CC  algorithms  that  exploit  cautious  inference  (Sections  5. 1-5.4),  and  by  describing  how  PLUL  can 
complement  these  algorithms  with  cautious  learning  (Section  5.5).  Section  5.6  then  discusses  the 
computational  complexity  of  these  algorithms. 

We  describe  and  evaluate  four  families  of  CC  inference  algorithms:  1C  A,  Gibbs,  LBP,  and 
wvRN.^  Among  local  classifier-based  algorithms,  we  chose  ICA  and  Gibbs  because  both  have  been 
frequently  studied  and  often  perform  well.  As  a  representative  global  formulation-based  algorithm, 
we  chose  LBP  instead  of  relaxation  labeling  because  previous  studies  (Sen  and  Getoor,  2007;  Sen 
et  ah,  2008)  found  similar  performance,  with  in  some  cases  a  slight  edge  for  LBP.  Finally,  we  select 
wvRN  because  it  is  a  good  relational-only  baseline  for  CC  evaluations  (Macskassy  and  Provost, 
2007). 

Table  1  summarizes  the  four  CC  families  that  we  consider.  Within  each  family,  each  variant  use 
more  cautious  inference  than  the  variant  listed  below  it.  Cautious  variants  of  standard  algorithms 
are  given  a  “C”  subscript  (e.g.,  ICAc),  while  non-cautious  variants  of  standard  algorithms  are  given 
a  “NC”  subscript  (e.g.,  Gibbs^c)-  For  the  latter  case,  our  intent  is  not  to  demonstrate  large  perfor¬ 
mance  “gains”  for  a  standard  algorithm  vs.  a  new  non-cautious  variant,  but  to  isolate  the  impact  of 
a  particular  cautious  algorithmic  behavior  on  performance.  While  the  result  may  not  be  a  theoret¬ 
ically  coherent  algorithm  (e.g.,  Gibbs^c^  unlike  Gibbs,  is  not  a  MCMC  algorithm),  in  every  case 
the  resultant  algorithm  does  perform  well  under  data  set  situations  where  caution  is  not  critical  (see 
Section  7).  Thus,  comparing  the  performance  of  the  cautious  vs.  non-cautious  variants  allows  us  to 
investigate  the  data  characteristics  for  which  cautious  behavior  is  more  important.® 

5.1  ICA  Family  of  Algorithms 

Figure  2  displays  pseudocode  for  ICA,  ICAc,  and  ICAkh,  depending  on  the  setting  of  the  parameter 
AlgType.  We  describe  each  in  turn. 

5.1.1  ICA 

In  Figure  2,  step  1  is  a  “bootstrap”  step  that  predicts  the  class  label  y,-  of  each  node  in  using 
only  attributes  {confj  records  the  confidence  of  this  prediction,  but  ICA  ignores  this  information). 
The  algorithm  then  iterates  (step  2).  During  each  iteration,  ICA  selects  all  available  labels  (step  3), 
computes  the  relational  features’  values  based  on  these  labels  (step  4),  and  then  re -predicts  the  class 
label  of  each  node  using  both  attributes  and  relational  features  (step  5).  After  iterating,  hopefully  to 
convergence,  step  6  returns  the  final  set  of  estimated  class  labels. 

Types  of  Caution  Used:  Steps  3-4  of  ICA  use  all  available  labels  for  feature  computation  (including 
estimated,  possibly  incorrect  labels)  and  step  5  picks  the  single  most  likely  label  for  each  node  based 
on  the  new  predictions.  In  these  steps,  uncertainty  in  the  predictions  is  ignored.  Thus,  ICA  does  not 


5.  Technically,  wvRN  by  itself  is  a  local  classifier,  not  an  inference  algorithm,  but  for  brevity  we  refer  to  the  family  of 
algorithms  based  on  this  classifier  (such  as  wvRNrL)  as  wvRN. 

6.  Section  7  shows  that  the  non-cautious  variants  ICA,  Gibbs^ic,  and  LBPi^c  perform  similarly  to  each  other.  Thus,  our 
empirical  results  would  change  little  if  we  compared  all  of  the  cautious  algorithms  against  the  more  standard  ICA. 
However,  the  results  for  Gibbs  and  LBP  would  then  concern  performance  differences  between  distinct  algorithms, 
due  to  conjectured  but  unconfirmed  differences  in  algorithmic  properties.  By  instead  comparing  Gibbs  vs.  Gibbs^ic 
and  LBP  vs.  LBPr/c,  we  more  precisely  demonstrate  that  the  cautious  algorithms  benefit  from  specifically  identified 
cautious  behaviors. 


2786 


Cautious  Collective  Classieication 


1  Name 

Cautious  Inf.? 

Key  Features 

Type 

Evaluated  by?  | 

Local  classifler-based  methods  that  iteratively  classify  nodes,  yielding  a  final  graph  state 

ICAc 

Favors  more 

Relational  features  depend  only  on 

Hard 

Neville  and  Jensen  (2000); 

conf.  labels 

“more  confident”  estimated  labels; 
later  iterations  loosen  confidence 

threshold. 

McDowell  et  al.  (2007a) 

ICAkh 

Favors  known 

First  iteration:  rel.  features  depend 

Hard 

McDowell  et  al.  (2007a) 

labels 

only  on  known  labels.  Later  iterations: 
use  all  labels. 

ICA 

Not  cautious 

Always  use  all  labels,  known  and  esti¬ 

Hard 

Lu  and  Getoor  (2003a);  Sen 

mated. 

and  Getoor  (2006);  Mc¬ 
Dowell  et  al.  (2007 a,b) 

Local  classifier-based  algorithms  that  compute  conditional  probabilities  for  each  node 

Gibbs 

Samples  from 

At  each  step,  classifies  using  all  neigh¬ 

Hard 

Jensen  et  al.  (2004);  Neville 

estimated 

bor  labels,  then  samples  new  la¬ 

and  Jensen  (2007);  Sen  et  al. 

distribution 

bels  from  the  resultant  distributions. 
Records  new  labels  to  produce  final 
marginal  statistics. 

(2008) 

Gibbssjc 

Not  cautious 

Like  Gibbs,  but  always  pick  most 

Hard 

None,  but  very  similar  to 

likely  label  instead  of  sampling. 

ICA. 

Global  formulation  algorithms  based  on  loopy  belief  propagation  (LBP) 

LBP 

Reasons  with 

Passes  continuous-valued  messages 

Soft 

Taskar  et  al.  (2002);  Neville 

estimated 

between  linked  neighbors  until  con¬ 

and  Jensen  (2007);  Sen  et  al. 

distribution 

vergence. 

(2008) 

LBPnc 

Not  cautious 

Like  LBP,  but  each  node  always 
chooses  single  most  likely  label  to  use 
for  next  round  of  messages. 

Hard 

— 

Relational-only  algorithms 

wvRNri 

Reasons  with 

Computes  new  distribution  by  aver¬ 

Soft 

Macskassy  and  Provost 

estimated 

aging  neighbors’  label  distributions; 

(2007);  Gallagher  et  al. 

distribution 

combines  old  and  new  distributions 
via  relaxation  labeling. 

(2008) 

wvRNica+c 

Favors  nodes 

Initializes  nodes  in  to  missing. 

Hard 

Macskassy  and  Provost 

closer  to  known 

Computes  most  likely  label  by  averag¬ 

(2007);  similar  to  Galstyan 

labels 

ing  neighbors’  labels,  ignoring  miss¬ 
ing  labels. 

and  Cohen  (2007) 

wvRNjca+nc 

Not  cautious 

Like  wvRNica+C’  but  no  missing  la¬ 
bels  are  used.  Instead,  initialize  nodes 
in  by  sampling  from  the  prior  label 

distribution. 

Hard 

Table  1 :  The  ten  eolleetive  inferenee  algorithms  eonsidered  in  this  artiele,  divided  into  four  fami¬ 
lies.  Hard/soft  refers  to  hard-labeling  and  soft-labeling  (see  Seetion  2.2). 


perform  eautious  inferenee.  However,  it  may  exploit  eautious  learning  to  learn  the  elassifier  models 
that  are  used  for  inferenee  (M^  and  Mar)- 

5.1.2  ICAc 

In  steps  3-4  of  Figure  2,  ICA  assumes  that  the  estimated  node  labels  are  all  equally  likely  to  be 
eorreet.  When  AlgType  instead  seleets  1C  Ac,  the  inferenee  beeomes  more  eautious  by  only  eon- 
sidering  more  eertain  estimated  labels.  Speeifieally,  step  3  “commits”  into  Y'  only  the  best  m  of 


2787 


McDowell,  Gupta  and  Aha 


IC A_classify  (V,E,X,Y'^ , Mar , Ma ,n,A lgType)= 

//  y=nodes,  £’=edges,  A=attribute  vectors,  F^=labels  of  known  nodes  {Y^  =  {y,  |v,  G  V^}  ) 

//  MA^=local  classifier  (uses  attributes  and  relations),  MA=classifier  that  uses  only  attributes 
//  n=#  of  iterations,  AlgType=ICAc,  ICAkh,  or  ICA 

1  for  each  node  v,  G  do  //  Bootstrap:  estimate  label  y,  for  each  node 

{yi,con fj)  <—  Ma {xi)  //  using  attributes  only 

2  for  =  0  to  n  do 

3  //  Select  node  labels  to  use  for  computing  relational  feature  values,  store  in  Y' 

if  (AlgType  =  ICAc)  //  For  ICAc'.  use  known  and  m  most  confident 

m  ^  \V^\  ■  {h/n)  //  estimated  labels,  gradually  increase  m 

Y'  ^  {y,|v/  G  Arank{confi)  <  m} 

else  if  (h  =  0)  and  {AlgType  =  ICAk„) 

Y'  <—  Y^  //  For  /CA;f„(first  iteration):  use  only  known  labels 

else  //  For  ICAkh  (after  first  iteration)  and  ICA:  use  all 

F' ^  G  //  labels  (known  and  estimated) 

4  for  each  node  v,  G  do 

fi  ^  calcRelatFeats{V,E,Y')  //  Compute  feature  values,  using  labels  selected  above 

5  for  each  node  v,  G  do  //  Re-predict  most  likely  label,  using  attributes 

{yi,confi)  ^  MAR{xi,fi)  //  and  relational  features 

6  return  {yi\vi  G  V^}  //  Return  predicted  class  label  for  each  node 


Figure  2:  Algorithm  for  ICA  family  of  algorithms.  We  use  u  =  10  iterations. 


the  eurrent  estimated  labels;  other  labels  are  eonsidered  missing  and  thus  ignored  in  the  next  step. 
Step  4  eomputes  the  relational  features  using  only  the  eommitted  labels,  and  step  5  elassifies  using 
this  information.  Step  3  gradually  inereases  the  fraetion  of  estimated  labels  that  are  eommitted  per 
iteration  (e.g.,  if  n=l0,  from  0%,  10%,  20%,...,  up  to  100%).  Node  label  assignments  eommitted  in 
an  iteration  h  are  not  neeessarily  eommitted  again  in  iteration  fi  +  1  (and  may  in  faet  ehange). 

ICAc  requires  some  kind  of  eonfidenee  measure  {confi  in  Figure  2)  to  determine  the  “best”  m  of 
the  eurrent  label  assignments  (those  with  the  highest  eonfidenee  “rank”).  We  adopt  the  approaeh  of 
Neville  and  Jensen  (2000)  and  use  the  posterior  probability  of  the  most  likely  elass  for  eaeh  node  i  as 
confc  In  exploratory  experiments,  we  found  that  alternative  measures  (e.g.,  probability  differenee 
of  the  top  two  elasses)  produeed  similar  results. 

Types  of  Caution  Used:  ICAc  favors  more  eonfident  information  by  ignoring  nodes  whose  labels 
are  estimated  with  lower  eonfidenee.  Sfep  3  exeeufes  fhis  preferenee,  whieh  affeefs  fhe  algorifhm 
in  several  ways.  Firsf,  omitting  fhe  esfimafed  labels  for  some  nodes  eauses  fhe  relational  fealure 
value  eompufafion  in  sfep  4  fo  ignore  fhose  less  eerfain  labels.  Sinee  fhis  eompufafion  favors  more 
reliable  label  assignmenfs,  subsequenf  assignmenfs  should  also  be  more  reliable.  Seeond,  if  any 
node  links  only  fo  nodes  wifh  missing  labels,  fhen  fhe  eompufed  value  of  fhe  relafional  fealures  for 
fhaf  node  will  also  be  missing-,  Seefion  6.5  deseribes  how  fhe  elassifier  in  Step  5  handles  fhis  ease. 
Third,  reeall  fhaf  a  realisfie  CC  seenario’s  fesf  sef  may  have  links  fo  nodes  wifh  known  labels;  fhese 
nodes,  represenfed  by  V^,  provide  fhe  “mosf  eerfain”  labels  and  fhus  may  aid  elassifiealion.  ICAc 
exploifs  only  fhese  labels  for  iteration  fi  =  0.  In  fhis  ease,  step  3  ignores  all  esfimafed  labels  (every 
esfimafe  for  V^),  buf  step  4  ean  sfill  eompufe  some  relational  fealure  values  based  on  known  labels 


2788 


Cautious  Collective  Classieication 


Gibbs  .classify  (V,E,X,Y^,  Mar  ,  Ma  ,n,nB,  C,AlgType)-- 

= 

//V 

=nodes,  £’=edges,  A=attribute  vectors,  F^=labels  of  known  nodes  {Y^  =  {y,jvi  G  V^}  ) 

//  MAR=\oc?i\  classifier  (uses  attributes  and  relations),  Ma 

=classifier  that  uses  only  attributes 

//  n 

=#  of  iterations,  nB=  “burn-in”  iters.,  C=set  of  class  labels,  AlgType=Gibbs  or  Gibbs^c 

1 

for  each  node  v,  G  do 

//  Bootstrap:  estimate  label  probs.  bi 

bi^MA{xi) 

//  for  each  node,  using  attributes  only 

2 

for  each  node  v,  G  do 

//  Initialize  statistics 

for  each  c  G  C 

stats\i]  [c]  <—  0 

3 

for  /i  =  1  to  n  do 

//  Repeat  for  n  iterations 

for  each  node  v,  G  do 

4 

switch  (AlgType): 

^  sampleDist{bi) 

case  (Gibbs):  yi 

//  Sample  next  label  from  distribution 

case  (GibbsNc):  yt 

^  argmaxcec  bi{c) 

//  Or,  pick  most  likely  label  from  dist. 

5 

if  {h  >  tib)  stats[i,yi\  <— 

stats[i,yi\  -\- 1 

//  Record  stats,  on  chosen  label 

6 

F'<-T^U{y,|v,GP^} 

//  Compute  feature  values,  using  known 

for  each  node  v,  G  do 

//  labels  and  labels  chosen  in  step  4 

fi  ^  computeRelatFeat 

ures{V,E,Y') 

7 

for  each  node  v,  G  do 

//  Re-estimate  label  probs.,  using 

bi  ^  MAR{xiJi) 

//  attributes  and  relational  features 

8 

return  stats 

//  Return  marginal  stats,  for  each  node 

Figure  3:  Algorithm  for  Gibbs  sampling.  Thousands  of  iterations  are  typieally  needed. 


from  .  Thus,  the  known  labels  influenee  the  first  elassifieation  in  step  5,  before  any  estimated 
labels  are  used,  and  in  subsequent  iterations.  Finally,  ICAc  ean  also  benefit  from  PLUL. 

5.1.3  ICAKn 

The  above  diseussion  highlighted  two  different  effeets  from  ICAc-  favoring  more  eonfident  esti¬ 
mated  labels  vs.  favoring  known  labels  from  An  interesting  variant  is  to  favor  the  known  labels 
in  the  first  iteration  (just  like  ICAc),  but  then  use  all  labels  for  subsequent  iterations  (just  like  ICA). 
We  eall  this  algorithm  ICAkh  (“ICA-i- Known”). 

Types  of  Caution  Used:  ICAku  favors  only  known  nodes.  It  is  thus  more  eautious  than  ICA,  but 
less  eautious  than  ICAc-  It  ean  also  benefit  from  eautious  learning  via  PLUL. 

5.2  Gibbs  Family  of  Algorithms 

Figure  3  displays  pseudoeode  for  Gibbs  sampling  {Gibbs)  and  the  non-eautious  variant  Gibbs^c- 
We  deseribe  eaeh  in  turn. 

5.2.1  Gibbs 

In  Figure  3,  step  1  (bootstrapping)  is  identieal  to  step  1  of  the  ICA  algorithms,  exeept  that  for  eaeh 
node  V,  the  elassifier  must  output  a  distribution  x,  eontaining  the  likelihood  of  eaeh  elass,  not  just 


2789 


McDowell,  Gupta  and  Aha 


the  most  likely  class.  Step  2  initializes  the  statistics  that  will  be  used  to  compute  the  marginal  class 
probabilities  for  each  node.  In  step  4,  within  the  loop,  the  algorithm  probabilistically  samples  the 
current  class  label  distribution  of  each  node  and  assigns  a  single  label  y,  based  on  this  distribution. 
This  label  is  also  recorded  in  the  statistics  during  Step  5  (after  the  first  ng  iterations  are  ignored  for 
“burn-in”).  Step  6  then  selects  all  labels  (known  labels  and  those  just  sampled)  and  uses  them  to 
compute  the  relational  features’  values.  Step  7  re-estimates  the  posterior  class  label  probabilities 
given  these  relational  features.  The  process  then  repeats.  When  the  process  terminates,  the  statistics 
recorded  in  step  5  approximate  the  marginal  distribution  of  class  labels,  and  are  returned  by  step  8. 
Types  of  Caution  Used:  Like  ICAc,  Gibbs  is  cautious  in  its  use  of  estimated  labels,  but  in  a  different 
way.  In  particular,  ICAc  exercises  caution  in  step  3  by  ignoring  (at  least  for  some  iterations)  labels 
that  have  lower  confidence.  In  confrasl,  Gibbs  exercises  caution  by  sampling,  in  step  4,  values  from 
each  node’s  predicfed  label  disfribufion — causing  nodes  wifh  lower  predicfion  confidence  fo  reflecf 
fhaf  uncerfainfy  via  higher  flucfuafion  in  fheir  assigned  labels,  yielding  less  predictive  influence  on 
fheir  neighbors.  Gibbs  can  also  benefif  from  cautious  learning  via  PLUL. 

We  expecf  Gibbs  fo  perform  better  fhan  ICAc,  since  if  makes  use  of  more  informafion,  buf  fhis 
requires  careful  confirmalion.  In  addifion,  Gibbs  is  considerably  more  time  infensive  fhan  ICAc  or 
ICA  (see  Section  5.6). 

5.2.2  GibbsNC 

GibbsNc  is  identical  fo  Gibbs  excepf  fhaf  insfead  of  sampling  in  step  4,  if  always  selecfs  fhe  mosf 
likely  label.  This  change  makes  GibbsNc  deferminisfic  (unlike  Gibbs),  and  makes  GibbsNc  behave 
almosf  idenfically  fo  ICA.  In  particular,  observe  fhaf  afler  any  number  of  iferafions  h  (\  <  h  <  n), 
ICA  and  GibbsNc  will  have  precisely  fhe  same  sef  of  currenf  label  assignmenfs  for  every  node. 
However,  /CA’s  resulf  is  fhe  final  sef  of  label  assignmenfs,  whereas  Gibbs^c’^  resulf  is  fhe  marginal 
slafisfics  compufed  from  fhese  lime- varying  assignmenfs.  For  a  given  dala  sef,  if  ICA  converges  fo 
an  an  unchanging  sef  of  label  assignmenfs,  Ihen  for  sufficienlly  large  n  GibbsNc  ^  final  resulf  (in 
terms  of  accuracy)  will  be  identical  fo  ICA’s.  If,  however,  some  nodes’  labels  continue  fo  oscillate 
wifh  ICA,  fhen  ICA  and  GibbsNc  will  have  differenl  resulfs  for  some  of  fhose  nodes. 

Types  of  Caution  Used:  Just  like  ICA,  GibbsNc  uses  all  available  labels  for  relational  feature 
computation,  and  always  picks  the  single  most  likely  label  based  on  the  new  predictions.  Thus, 
GibbsNc  doss  not  perform  cautious  inference,  though  it  can  benefit  from  cautious  learning  to  learn 
the  classifiers  Ma  and  Mar- 

5.3  LBP  Family  of  Algorithms 

This  section  describes  loopy  belief  propagation  {LBP)  and  the  non-cautious  variant  LBPf^c- 
5.3.1  LBP 

LBP  has  been  a  frequently  studied  technique  for  performing  approximate  inference,  and  has  been 
used  both  in  early  work  on  CC  (Taskar  et  ah,  2002)  and  in  more  recent  evaluations  (Sen  and  Getoor, 
2006;  Neville  and  Jensen,  2007;  Sen  et  ah,  2008).  Most  works  that  study  LBP  for  CC  treat  the 
entire  graph,  including  attributes,  as  a  pairwise  Markov  random  field  (e.g.,  Sen  and  Getoor,  2006; 
Sen  et  ah,  2008)  and  then  justify  LBP  as  an  example  of  a  variational  method  (cf.,  Yedidia  et  ah, 
2000).  The  basic  inference  algorithm  is  derived  from  belief  propagation  (Pearl,  1988),  but  applied 
to  graphs  with  cycles  (McEliece  et  ah,  1998;  Murphy  et  ah,  1999). 


2790 


Cautious  Collective  Classieication 


LBP  performs  inference  via  passing  messages  from  node  to  node.  In  particular,  repre¬ 

sents  node  v,’s  assessment  of  how  likely  it  is  that  node  vj  has  a  true  label  of  class  c.  In  addition, 
(|),(c)  represents  the  “non-relational  evidence”  (e.g.,  based  only  on  attributes)  for  v,  having  class  c, 
and  \|/,y(c',c)  represents  the  “compatibility  function”  which  describes  how  likely  two  nodes  of  class 
c  and  c'  are  linked  together  (in  terms  of  Markov  networks,  this  represents  the  potential  functions 
defined  by  the  pairwise  cliques  of  linked  class  nodes).  Given  these  two  sets  of  functions,  Yedidia 
et  al.  (2000)  show  the  belief  that  node  i  has  class  c  can  be  calculated  as  follows: 


bi{c)  =  a(|),(c)  mk^i{c)  (1) 

keNi 

where  a  is  a  normalizing  factor  to  ensure  that  Y^cec^ii^)  —  1  Ni  is  the  neighborhood  function 
defined  as: 


Ni  =  {vj\3{vi,vj)  €E}  . 

The  messages  themselves  are  computed  recursively  as: 


n  .  (2) 

c’ec  \  keNt\j  J 

Observe  that  the  message  from  i  to  j  incorporates  the  beliefs  of  all  the  neighbors  of  i  {Ni)  except  j 
itself.  m)^j{c)  is  the  “new”  value  of  mi^j{c)  to  be  used  in  the  next  iteration. 

For  CC,  we  need  a  model  that  generalizes  from  the  training  nodes  to  the  test  nodes.  The  above 
equations  do  not  provide  this,  since  they  have  node-specific  pofenfial  funcfions  (i.e.,  \|/;y  is  specific 
fo  nodes  i  and  j).  Forfunafely,  we  can  represenf  each  pofenfial  funcfion  as  a  log-linear  combination 
of  generalizable  feafures,  as  commonly  done  for  such  Markov  nefworks  (e.g.,  Della  Piefra  el  ah, 
1997;  McCallum  el  ah,  2000a).  More  specifically  for  CC,  Taskar  el  al.  (2002)  used  a  log-linear 
combinafion  of  funcfions  lhal  indicale  fhe  presence  or  absence  of  parlicular  allribufes  or  olher  fea- 
lures.  Several  papers  (e.g..  Sen  and  Geloor,  2006;  Sen  el  al.,  2008)  have  described  a  general  model 
on  how  lo  accomplish  Ibis,  bul  do  nol  completely  explain  how  lo  perform  fhe  compulalion.  For  a 
slighl  loss  in  generalify  (e.g.,  assuming  lhal  our  nodes  are  represenled  by  a  simple  allribule  veclor), 
we  now  describe  how  lo  perform  LBP  for  CC  on  an  undirecled  graph.  In  parlicular,  tel  Na  be  Ihe 
number  of  allribules,  be  Ihe  domain  of  allribule  h,  and  Wc,h.k  be  a  learned  weighl  indicating  how 
slrongly  a  value  of  k  for  allribule  h  indicates  lhal  a  given  node  has  class  c.  In  addition,  tel  /,  (/i,k)=l 
iff  Ihe  allribule  of  node  i  is  k  (i.e.,  xtu  =  k).  Then 


(|)i(c)  =exp[  Yj  Yj  ^xp{wc,h,k)fi{h,k)) 

\he{i..NA}ke‘DH  / 

which  is  a  special  case  of  logistic  regression.  We  likewise  define  similar  learned  weighls  of  Ihe 
form  Wc,c'  lhal  indicate  how  likely  a  node  wilh  label  c  is  linked  lo  a  node  wilh  label  c',  yielding  Ihe 
compalibilily  function 


\\fijic,c')  =  exp{wc,c’)  ■ 


2791 


McDowell,  Gupta  and  Aha 


LBP_classify  (V,E,X,Y^ ,w, C,N,AlgType)= 
//  y=nodes,  £’=edges,  A=attribute  vectors,  Y^- 

=labels  of  known  nodes  (Y^  =  {y,  |vi  G  V^}  ) 

// 

w=learned  params.,  C=set  of  class  labels,  A=neighborhood  funct.,  AlgType=LBP  or  LBP^c 

1 

for  each  {vi,Vj)  S  E  such  that  Vj  €  do 

//  Initialize  all  messages 

for  each  c  e  C  do 

if(v,ey^) 

//  If  class  is  known  (y,),  set  message  to  its 

nii^jic)  ^  a  ■  exp{wy.^c) 

//  final,  class-specific  value 

else 

//  Otherwise,  message  starts  with  same  value 

mi^j(c)  ^  a 

//  for  every  class,  but  will  vary  later 

2 

while  (messages  are  still  changing) 

3 

for  each  (v,-,  vj)  e  E  such  that  Vj  e 

do  //  Perform  message  passing 

for  each  c  G  C  do 

4 

for  each  (v,-,  v/)  G  £  such  that  Vj  G 

do 

if  (AlgType  =  LBP) 

//  For  LBP,  copy  new  messages  for  use  in 

nii^jic)  ^ 

//  next  iteration 

else 

c'  ^  argmaxcec{m'i^j{c)) 

//  For  LBPnc,  select  most  likely  label  for  node 

for  each  c  G  C  do 

//  Treat  selected  label  the  same  as  a  “known” 

m^jic)  ^  exp{wc'y) 

//  label  for  use  in  the  next  iteration 

5 

for  each  node  v,  G  y^  do 

//  Compute  final  beliefs 

for  each  c  G  C  do 

bi{c)  ^  a(ifi{c)l\keNimk^i{c) 

6 

return  {bt} 

//  Return  final  beliefs 

Figure  4:  Algorithm  for  loopy  belief  propagation  (LBP).  a  is  a  normalization  faetor. 


As  desired,  the  eompatibility  funetion  is  now  independent  of  speeifie  node  identifiers,  that  is,  it 
depends  only  upon  the  elass  labels  c  and  c',  not  i  and  j.  We  use  eonjugate  gradient  deseent  to  learn 
the  weights  (ef.,  Taskar  et  ah,  2002;  Neville  and  Jensen,  2007;  Sen  et  ah,  2008). 

Finally,  we  must  eonsider  how  to  handle  messages  from  nodes  with  a  “known”  elass  label. 
Suppose  node  v,  has  known  elass  y,.  This  is  equivalent  to  having  a  node  where  the  non-relational 
evidenee  (|),(c)  =  1  if  c  is  y,-  and  zero  otherwise.  Sinee  y,  is  known,  node  v,-  is  not  infiueneed  by  its 
neighbors.  In  that  ease,  using  Equation  2  (with  an  empty  neighborhood  for  the  produet)  yields: 

mi^j{c)  =  a  ^  (|),(c  )t|/,7(c',c)  =  a  ■  ^ij{yi,c)  =  a  ■  exp{wy^^c)  ■  (3) 

c'ec 

Given  these  formulas,  we  ean  now  present  the  eomplete  algorithm  in  Figure  4.  In  Step  1,  the 
messages  are  initialized,  using  Equation  3  if  v,  is  a  known  node;  otherwise,  eaeh  value  is  set  to  a 
(ereating  a  uniform  distribution).  Steps  2-4  performs  message  passing  until  eonvergenee,  based  on 
Equation  2.  Einally,  step  5  eomputes  the  final  beliefs  using  Equation  1  and  sfep  6  refums  fhe  resulfs. 
Types  of  Caution  Used:  Eike  Gibbs,  LBP  exereises  eaufion  by  reasoning  based  on  fhe  esfimafed 
label  uneerfainfy,  buf  in  a  differenl  manner.  Insfead  of  sampling  from  fhe  esfimafed  disfribufion, 
LBP  in  sfep  3  direefly  updafes  ifs  beliefs  using  all  of  ifs  eurrenf  beliefs,  so  fhaf  fhe  new  beliefs 
refleel  fhe  underlying  uneerfainfy  of  fhe  old  beliefs.  In  parfieular,  fhis  uneerlainfy  is  expressed 


2792 


Cautious  Collective  Classieication 


WVRN_RL classify  (V, Y^,n, C, bprior,N, r)= 

//  y=nodes,  F^=labels  of  known  nodes  (Y^  =  {y/lv,  G  V^}),  n=#  of  iterations 
//  C=set  of  class  labels,  foprior=class  priors,  A^=neighborhood  funct.,  r=decay  factor 


1  for  each  node  v,  S  do 

bi «—  makeBelie f sFromKnownClass{\C\,yi) 
for  each  node  v,  S  do 
bi  ^  bprior 

2  for  /i  =  0  to  n  do 

3  for  each  node  v,  S  do 

^  T.vjeNi  bj 

4  for  each  node  v,  S  do 

bi^r'-Pi  +  ii-r^ 

5  return  {Si  I  Vi  G  y^} 


//  Create  belief  vector  for  each  known  label 
//  (all  zeros  except  at  index  for  class  yO 
1 1  Create  initial  beliefs  for  unknown  labels 
//  (using  class  priors  as  initial  setting) 

//  Iteratively  re-compute  beliefs 
//  Compute  new  distribution  for  each  node 
//  by  averaging  neighbors’  distributions 

//  Perform  simulated  annealing 
//  Return  belief  distribution  for  each  node 


Figure  5:  Algorithm  for  wvRNrl.  Based  on  Maeskassy  and  Provost  (2007),  we  use  n  =  100  itera¬ 
tions  with  a  deeay  faetor  of  F  =  0.99. 


by  the  eontinuous-valued  numbers  that  represent  eaeh  message  nti^j.  LBP  ean  also  benefit  from 
eautious  learning  with  PLUL;  in  this  ease,  PLUL  influenees  the  Wcji,k  and  Wc^d  weights  that  are 
learned  (see  Seetion  6.4). 

5.3.2  LBPnc 

LBPnc  is  identieal  to  LBP  exeept  that  after  the  new  messages  are  eomputed  in  step  3,  in  step  4 
LBPnc  picks  the  single  most  likely  label  c'  to  represent  the  message  from  v,  to  vy.  LBPnc  then  treats 
c'  as  equivalent  to  a  “known”  label  y,  for  v,-  and  re-eomputes  the  appropriate  message  (c)  using 
Equation  3. 

Types  of  Caution  Used:  Like  ICA  and  GibbsNc^  LBPnc  is  non-eautious  beeause  it  uses  all  available 
labels  for  relational  feature  eomputation  and  always  pieks  the  single  most  likely  label  based  on  the 
new  predietions.  In  essenee,  the  “piek  most  likely”  step  transforms  the  soft-labeling  LBP  algorithm 
into  the  hard-labeling  LBPnc  algorithm,  removing  eautious  inferenee  just  as  the  “piek  most  likely” 
step  did  for  GibbsNc-  However,  LBPnc,  like  LBP,  ean  still  benefit  from  eautious  learning  with 
PLUL. 

5.4  wvRN  Family  of  Algorithms 

Ligure  5  displays  pseudoeode  for  wvRNrl,  a  soft-labeling  algorithm.  Lor  simplieity,  we  present  the 
related,  hard-labeling  variants  wvRNjca+c  and  wvRNjca+nc  separately  in  Ligure  6.  Eaeh  of  these  is 
a  relational-only  algorithm;  Seetion  7.9  will  diseuss  variants  that  ineorporate  attribute  information. 

5.4.1  wvRNrl 

wv/?ARL(Weighted-Vote  Relational  Neighbor,  with  relaxation  labeling)  is  a  relational-only  CC  al¬ 
gorithm  that  Maeskassy  and  Provost  (2007)  argued  should  be  eonsidered  as  a  baseline  for  all  CC 


2793 


McDowell,  Gupta  and  Aha 


WVRN  JCA_classify  (V,  ,n,C,bprior-,N ,AlgType)= 

H  y=nodes,  F^=labels  of  known  nodes  =  {j/  v,  G  V^}), 

n=#  of  iters.,  C=class  labels 

//  f>pr,or=class  ptiors,  A=neighborhood  function,  AlgType=wvRNicA+c  or  wvRNjca+nc 

1  for  each  node  v,  S  do 

switch  (AlgType): 

//  Set  initial  value  for  unknown  labels... 

cast  (wvRNica+c)-  Yi^ 

//  ...start  labels  as  missing 

case  (wvRNica+ncY-  yi  ^  sampleDist{b prior) 

//  ...or  sample  label  from  class  priors 

2  for  /z  =  0  to  n  do 

//  Iteratively  re-label  the  nodes 

3  for  each  node  v,  S  do 

Nl^{vj€Ni\yj^  '!'] 

//  Find  all  non-missing  neighbors 

if(|A/|>0) 

//  New  label  is  the  most  common  label 

y)  ^  argmaxcec  \  {vj  G  N'\yj  =  c}  | 

//  amongst  those  neighbors 

else  y)  =  y,- 

//  If  no  such  neighbors,  keep  same  label 

4  for  each  node  v,  G  do 

//  After  all  new  labels  are  computed. 

Yi  ^  Yi 

//  update  to  store  the  new  labels 

5  return  {y,  v,' G 

//  Return  est.  class  label  for  each  node 

Figure  6:  Algorithm  for  wvRNjca+c  and  wvRNica+nc-  This  is  a  “hard  labeling”  version  of  wvRNrl, 
eaeh  of  the  5  steps  eorresponds  to  the  same  numbered  step  in  Figure  5.  We  use  n  =  100 
iterations. 


evaluations.  At  eaeh  iteration,  eaeh  node  i  updates  its  estimated  elass  distribution  by  averaging  the 
eurrent  distributions  of  eaeh  of  its  linked  neighbors.  wvRNrl  ignores  all  attributes  (non-relational 
features).  Thus,  wvRNrl  is  useful  only  if  the  test  set  links  to  some  nodes  with  known  labels  to 
“seed”  the  inferenee  proeess.  Maeskassy  and  Provost  showed  that  this  simple  algorithm  ean  work 
well  if  the  nodes  exhibit  strong  homophily  and  enough  labels  are  known. 

Step  1  of  wvRNrl  (Figure  5)  initializes  a  belief  veetor  for  every  node,  using  the  known  labels 
for  nodes  in  V^,  and  a  elass  prior  distribution  for  nodes  in  .  For  eaeh  node,  step  3  averages  the 
eurrent  distributions  of  its  neighbors,  while  step  4  performs  simulated  annealing  to  ensure  eonver- 
genee.  Step  5  returns  the  final  beliefs.  For  simplieity,  we  omit  edge  weights  from  the  algorithm’s 
deseription,  sinee  our  experiments  do  not  use  them. 

Types  of  Caution  Used:  Sinee  wvRNrl  eomputes  direetly  with  the  estimated  label  distributions,  it 
exereises  eautious  inferenee  in  the  same  manner  as  LBP.  However,  unlike  the  other  CC  algorithms, 
it  does  not  learn  from  a  training  set,  and  thus  eautious  learning  with  PLUL  does  not  apply. 

5.4.2  wvRNjca+c  and  wvRNjca+nc 

Figure  6  presents  a  hard-labeling  alternative  to  wvRNrl-  Eaeh  of  the  five  sfeps  mirror  fhe  eorre- 
sponding  sfep  in  fhe  deseripfion  of  wvRNrl-  In  parfieular,  for  node  v/,  sfep  3  eompufes  fhe  mosf 
eommon  label  among  fhe  neighbors  of  v,  (fhe  hard-labeling  equivalenf  of  averaging  fhe  disfribu- 
fions),  and  sfep  4  eommifs  fhe  new  labels  wifhouf  annealing. 

However,  wifh  a  hard-labeling  algorifhm,  fhe  inifial  labels  for  eaeh  node  beeome  very  imporfanf. 
The  simples!  approaeh  would  be  fo  initialize  every  node  fo  have  fhe  mosf  eommon  label  from  fhe 
prior  disfribufion.  However,  fhaf  approaeh  eould  easily  produee  inferlinked  regions  of  labels  fhaf 


2794 


Cautious  Collective  Classieication 


that  were  ineorreet  but  highly  self-eonsistent;  leading  to  errors  even  when  many  known  labels  were 
provided.  Instead,  Maeskassy  and  Provost  (2007)  suggest  initializing  eaeh  node  v,  €  to  missing 
(indieated  in  Figure  6  by  a  question  mark),  a  value  that  is  ignored  during  ealeulations.  They  eall  the 
resulting  algorithm  wvRN-ICA;  here  we  refer  to  it  as  wvRNjca+c-  A  missing  label  remains  for  node 
Vi  after  iteration  h  if  during  that  iteration  every  neighbor  of  v,  was  also  missing. 

Alternatively,  a  simpler  algorithm  is  to  always  eompute  with  all  neighbor  labels  (do  not  initialize 
any  to  missing),  but  initialize  eaeh  label  in  by  sampling  from  the  prior  distribution.  We  eall  this 
algorithm  wvRNica+nc-  This  proeess  is  the  hard-labeling  analogue  of  wvRNrl’s  approaeh:  instead 
of  initializing  each  node  with  the  prior  distribution,  with  wvRNjca+nc  sampling  initializes  the  entire 
set  so  that  it  represents,  in  aggregate,  the  prior  distribution. 

Types  of  Caution  Used:  wvRNjca+nc  always  uses  the  estimated  label  of  every  node,  without  regard 
for  how  eertain  that  estimate  is.  Thus,  it  does  not  exhibit  eautious  inferenee.  However,  wvRNjca+c 
does  exhibit  eautious  inferenee,  although  this  effeet  was  not  diseussed  by  prior  work  with  this 
algorithm.  In  partieular,  during  the  first  iteration  wvRNjca+c  uses  only  the  eertain  labels  from 
sinee  all  nodes  in  are  marked  missing.  These  known  labels  are  used  to  estimate  labels  for  every 
node  in  that  is  direetly  adjaeent  to  some  node  in  V^.  In  subsequent  iterations,  wvRNjca+c 
uses  both  labels  from  Y^  and  labels  from  Y^  that  have  been  estimated  so  far.  However,  the  labels 
estimated  so  far  are  likely  to  be  more  reliable  than  later  estimations,  sinee  the  former  labels  are 
from  nodes  that  were  eloser  to  at  least  one  known  label.  Thus,  in  a  manner  similar  to  ICAc’s 
gradual  eommitment  of  labels  based  on  eonfidenee,  wvRNjca+c  gradually  ineorporates  more  and 
more  estimated  labels  into  its  eomputation,  where  more  eonfident  labels  (those  eloser  to  known 
nodes)  are  ineorporated  sooner.  This  effeet  eauses  wvRNjca+c  to  exploit  estimated  labels  more 
eautiously. 

5.5  Parameter  Learning  for  Uncertain  Labels  (PLUL) 

CC  algorithms  typically  train  a  local  classifier  on  a  fully-labeled  training  set,  then  use  that  local 
classifier  with  some  collective  inference  algorithm  to  classify  the  test  set.  Unfortunately,  this  results 
in  asymmetric  training  and  test  phases:  since  all  labels  are  known  in  the  training  phase,  the  learning 
process  sees  no  uncertainty  in  relational  feature  values,  unlike  the  reality  of  testing.  Moreover, 
the  classifier’s  training  is  unaffected  by  the  type  of  collective  inference  algorithm  used,  and  how 
(if  at  all)  that  collective  algorithm  attempts  to  compensate  for  the  uncertainty  of  estimated  labels 
during  testing.  Consequently,  the  learned  classifier  may  tend  to  produce  poor  estimates  of  important 
parameters  related  to  the  relational  features  (e.g.,  feature  weights,  conditional  probabilities).  Even 
for  CC  algorithms  that  do  not  use  a  local  classifier,  but  instead  take  a  global  approach  that  learns 
over  the  entire  training  graph  (as  with  LBP  and  relaxation  labeling),  the  same  fundamental  problem 
occurs:  if  autocorrelation  is  present,  then  parameters  learned  over  the  fully  labeled  training  set  tend 
to  overstate  the  usefulness  of  relational  features  for  testing,  where  estimated  labels  must  be  used. 

To  address  these  problems,  we  developed  PLUL  (Parameter  Learning  for  Uncertain  Labels). 
PLUL  is  based  on  standard  cross-validation  techniques  for  performing  automated  parameter  tuning 
(e.g.,  Kohavi  and  John,  1997).  The  key  novelty  is  not  in  the  cross-validation  mechanism,  but  in  the 
selection  of  which  parameters  should  be  tuned  and  why.  To  use  PLUL,  we  must  first  select  or  create 
an  appropriate  parameter  that  controls  the  amount  of  impact  that  relational  features  have  on  the 
resultant  classifications.  In  principle,  PLUL  could  search  a  multi-dimensional  parameter  space,  but 
for  tractability  we  select  a  single  parameter  that  affects  all  relational  features.  For  instance,  when 


2795 


McDowell,  Gupta  and  Aha 


PLUL Jearn  {CCtype,P,lp, Vjr ,ETr, ^Tr ,YTr,VH ,Eh ,Xh ,Yh)= 

//  CCtype=CC  alg.  to  use,  P=set  of  parameter  values  to  consider,  lp=labeled  proportion  to  use 
//  VTr,ETr,XTr,YTr  =  Vertices,  edges,  attributes,  and  labels  from  training  graph 
//  Vh,Eh,XhjYh  =  vertices,  edges,  attributes,  and  labels  from  holdout  graph 

1  Y^  =  keepSomeLabelsil p,  Yh)  //  Randomly  select  lp%  of  labels  to  keep;  discard  others 

2  bestParam  ^0  //  Initialize  variables  to  track  best  parameter  so  far 

bestAcc  < - 1 

3  for  each  p  G  P  do  1 1  Iterate  over  every  parameter  value 

4  //  Learn  complete  CC  classifier  from  fully-labeled  training  data,  influenced  by  p 
cc  =  learnjCC-classifier{CCtype,VTr,ETr,XTr,YTr,p) 

5  1 1  Run  CC  on  holdout  graph  (with  some  known  labels  and  evaluate  accuracy 
acc  <—  execute JOCdnference{cc,'VH, Eh, Xfj^Y^) 

6  //  Remember  this  parameter  if  it’s  the  best  so  far 
if  [acc  >  bestAcc) 

bestParam  <—  p 
bestAcc  ^  acc 

7  return  bestParam  1 1  Return  best  parameter  found  over  the  holdout  graph 


Figure  7:  Algorithm  for  Parameter  Learning  for  Uneertain  Labels  (PLUL).  The  holdout  graph  is 
derived  from  the  original  training  data  and  is  disjoint  from  the  graph  that  is  used  later  for 
testing. 


using  a  k-nearest- neighbor  rule  as  the  loeal  elassifier,  we  employ  PLUL  to  adjust  the  weight  wr  of 
relational  features  in  the  node  similarity  funetion.  PLUL  performs  automated  tuning  by  repeatedly 
evaluating  different  values  of  the  seleeted  parameter,  as  used  by  the  loeal  elassifier,  together  with 
the  eolleetive  inferenee  algorithm  (or  the  entire  learned  model  for  LBP).  For  eaeh  parameter  value, 
aeeuraey  is  evaluated  on  a  holdout  set  (a  subset  of  the  training  set).  PLUL  then  seleets  the  parameter 
value  that  yields  the  best  aeeuraey  to  use  for  testing. 

Figure  7  summarizes  these  key  steps  of  PLUL  and  some  additional  details.  First,  note  that 
proper  use  of  PLUL  requires  a  holdout  set  that  refleets  the  test  set  eonditions.  Thus,  step  1  of  the 
algorithm  removes  some  or  all  of  the  labels  from  the  holdout  set,  leaving  only  the  same  pereentage 
of  labels  {lp%)  that  are  expeeted  in  the  test  set.  Seeond,  running  CC  inferenee  with  a  new  parameter 
value  may  require  re-learning  the  loeal  elassifier  (for  ICA  or  Gibbs)  or  the  entire  learned  model  (for 
LBP).  This  is  shown  in  step  4  of  Figure  7.  Alternatively,  for  Naive  Bayes  or  k-nearest-neighbor 
loeal  elassifiers,  the  existing  elassifier  ean  simply  be  updated  to  refleet  the  new  parameter  value. 

We  expeet  PLUL’s  utility  to  vary  based  upon  the  fraetion  of  known  labels  {Ip)  that  are  available 
to  the  test  set.  If  there  are  few  sueh  labels,  there  is  more  diserepaney  between  the  training  and  test 
environments,  and  henee  more  need  to  apply  PLUL.  However,  if  there  are  many  sueh  labels,  then 
PLUL  may  not  be  useful. 

Beeause  almost  all  CC  algorithms  learn  parameters  based  in  some  way  on  relational  features, 
PLUL  is  widely  applieable.  In  partieular.  Table  2  shows  how  we  seleet  an  appropriate  relational 
parameter  to  apply  PLUL  for  different  CC  algorithms.  The  top  of  the  table  deseribes  how  to  apply 
PLUL  to  a  loeal  elassifier  that  is  designed  to  be  used  with  a  CC  algorithm  like  ICA  or  Gibbs.  The 


2796 


Cautious  Collective  Classieication 


Local  Classifier  (or  CC 
algorithm) 

Parameter  set  by  PLUL  (per  re¬ 
lational  feature) 

Values  tested  by  PLUL  (default  in 
bold) 

Naive  Bayes  (NB) 

Hyperparameter  a  for  Diriehlet 
prior 

1,  2,  4,  8,  16,  32,  64,  128,  256,  512, 
1024,  2048,  4096 

Logistic  Regression  (LR) 

Variance  of  Gaussian  prior 

5,  10,  20,  40,  80,  160,  320,  640, 
1280,  2560,512 

k-Nearest  Neighbor  (kNN) 

Weight  wr 

0.01,  0.03,  0.0625,  0.125,  0.25,  0.5, 
0.75, 1.0,  2.0 

LBP 

Variance  of  Gaussian  prior 

5,  10,  20,  100,  200,  1000,  10000, 
100000,  1000000 

Table  2:  The  elassifiers  (NB,  LR,  and  kNN)  and  CC  algorithm  (LBP)  used  in  our  experiments  for 
whieh  PLUL  ean  be  applied  to  improve  performanee.  The  seeond  eolumn  lists  the  key 
relational  parameters  that  we  identified  for  PLUL  to  learn,  while  the  last  eolumn  shows 
the  values  that  PLUL  eonsiders  in  its  eross-validation. 


last  row  demonstrates  how  it  ean  instead  be  applied  to  a  global  algorithm  like  LBP.  For  instanee,  for 
the  NB  elassifier,  most  previous  researeh  has  used  either  no  prior  or  a  simple  Laplaeian  (“add  one”) 
prior  for  eaeh  eonditional  probability.  By  instead  using  a  Diriehlet  prior  (Heekerman,  1999),  we  ean 
adjust  the  “hyperparameter”  a  of  the  prior  for  eaeh  relational  feature.  Larger  values  of  a  translate  to 
less  extreme  eonditional  probabilities,  thus  tempering  the  impaet  of  relational  features.  For  the  kNN 
elassifier,  redueing  fhe  weighf  of  relafional  feafures  has  a  similar  nef  effeel.  For  fhe  LR  elassifier 
and  fhe  LBP  algorifhm,  bofh  feehniques  involve  iferafive  MAP  esfimafion.  Inereasing  fhe  value 
of  fhe  varianee  of  fhe  Gaussian  prior  for  relafional  feafures  eauses  fhe  eorresponding  paramefer  fo 
“fif”  less  elosely  fo  fhe  fraining  dafa,  again  making  fhe  algorifhm  more  eaufious  in  ifs  use  of  sueh 
relafional  feafures. 

While  fhe  eore  meehanism  of  PLUL — eross-validafion  funing — is  eommon,  feehniques  like 
PLUL  fo  explieifly  eompensafe  for  fhe  bias  ineurred  from  fraining  on  a  fully-labeled  sef  while 
fesfing  using  esfimafed  labels  have  nol  been  previously  used  for  CC.  A  possible  exeepfion  is  Lu  and 
Gefoor  (2003a),  who  appear  fo  have  used  a  similar  feehnique  fo  fune  a  relational  paramefer,  buf, 
in  eonfrasf  fo  fhis  work,  fhey  did  nof  diseuss  ifs  need,  fhe  speeifie  proeedure,  or  fhe  performanee 
impaef. 

PLUL  affempfs  fo  eompensafe  for  fhe  bias  ineurred  from  fraining  on  fhe  eorreefly-labeled  frain¬ 
ing  sef.  Alfernafively,  Kou  and  Cohen  (2007)  deseribe  a  “sfaeked  model”  fhaf  learns  based  on 
esfimafed,  rafher  fhan  frue  labels.  While  fhe  original  goal  of  fhis  sfaeked  approaeh  was  fo  produee  a 
more  fime-effieienf  algorifhm,  Fasf  and  Jensen  (2008)  reeenfly  demonsfrafed  fhaf  fhis  feehnique,  by 
eliminafing  fhe  bias  befween  fraining  and  fesfing,  does  indeed  reduee  “inferenee  bias.”  This  redueed 
bias  enables  fhe  sfaeked  models  fo  perform  eomparably  fo  Gibbs  sampling,  even  fhough  fhe  sfaeked 
model  is  a  simpler,  non-ileralive  algorifhm  fhaf  eonsequenfly  has  higher  learning  bias.  Inferesfingly, 
Fasf  and  Jensen  (2008)  note  fhaf  fhe  sfaeked  model  performs  an  “implieif  weighfing  of  loeal  and 
relational  feafures,”  as  wifh  PLUL.  The  sfaeked  model  aeeomplishes  fhis  by  varying  fhe  learning 
and  inferenee  proeedure,  whereas  PLUL  modifies  only  fhe  learning  proeedure,  and  fhus  works  wifh 
any  inferenee  algorifhm  fhaf  relies  on  a  learned  model. 


2797 


McDowell,  Gupta  and  Aha 


5.6  Computational  Complexity  and  the  Cost  of  Caution 

For  learning  and  inference,  all  of  the  CC  algorithms  (variants  of  ICA,  Gibbs,  wvRN,  and  LBP) 
use  space  that  is  linear  in  the  number  of  nodes/instances  (Ni).  ICA  and  Gibbs  have  significant 
similarities,  so  we  consider  their  time  complexity  first.  For  these  two  algorithms,  the  dominant 
computation  costs  for  inference  stem  from  the  time  to  compute  relational  features  and  the  time  to 
classify  each  node  with  the  local  classifier.  Typically,  nodes  are  connecfed  fo  a  small  number  of 
ofher  insfances,  so  fhe  firsl  cosf  is  0{Ni)  per  iferafion.  For  fhe  second  cosf,  fhe  lime  per  iferalion  is 
0{Ni)  for  NB  and  LR,  and  0{N})  for  kNN.  However,  fhe  number  of  ileralions  varies  significanlly. 
Based  on  previous  work  (Neville  and  Jensen,  2000;  McDowell  el  ah,  2007a),  we  sel  n  =  10  for 
varianfs  of  /CA;  more  ileralions  did  nol  improve  performance.  In  conlrasl,  Gibbs  lypically  requires 
thousands  of  ileralions. 

Adding  or  removing  caulious  inference  lo  ICA  and  Gibbs  does  nol  significanlly  change  Iheir 
lime  complexify.  In  parlicular,  Gibbs^c  has  Ihe  same  complexify  as  Gibbs.  ICAq  inlroduces  an 
additional  cosl,  compared  lo  ICA,  of  0{NjlogNj)  per  iteration  lo  sorl  Ihe  nodes  by  confidence. 
However,  in  practice  classification  time  usually  dominates.  Therefore,  Ihe  overall  compulalional 
cosl  per  iteration  for  all  varianls  of  ICA  and  Gibbs  are  roughly  Ihe  same,  bul  Ihe  larger  number  of 
iterations  for  varianls  of  Gibbs  makes  Ihem  much  more  lime-expensive  lhan  ICA,  ICA^n,  or  ICAq. 

LBP  does  nol  explicilly  compute  relational  fealures,  bul  ils  main  loop  iterates  over  all  neighbors 
of  each  node,  Ihus  again  yielding  a  cosl  of  0{Nj)  per  iteration  under  Ihe  same  assumptions  as 
above.  We  found  lhal  LBP  inference  was  comparable  in  cosl  lo  lhal  of  ICA,  which  agrees  wilh 
Sen  and  Geloor  (2007).  However,  Iraining  Ihe  LBP  classifier  is  much  more  expensive  lhan  Iraining 
Ihe  olher  algorilhms.  ICA  and  Gibbs  only  require  Iraining  Ihe  local  classifier,  which  involves  zero 
lo  one  passes  over  Ihe  dala  for  kNN  and  NB,  and  a  relatively  simple  optimization  for  LR.  On  Ihe 
olher  hand,  Iraining  LBP  wilh  conjugate  gradienl  requires  executing  LBP  inference  many  times.  We 
found  Ibis  Iraining  lo  be  al  leasl  an  order  of  magnilude  slower  lhan  Ihe  olher  algorilhms,  as  also 
reported  by  Sen  and  Geloor  (2007).  LBPnc  has  Ihe  same  Iheorelical  and  practical  time  resulls  as 
LBP. 

wvRN  is  Ihe  simplesl  CC  algorilhm,  since  if  requires  no  fealure  compulation  and  Ihe  key  step 
of  each  iteration  is  a  simple  average  over  Ihe  neighbors  of  each  node.  As  wilh  previous  algorilhms, 
assuming  a  small  number  of  neighbors  for  each  node  yields  a  lolal  time  per  iteration  of  0{Nj).  Prior 
work  (Macskassy  and  Provosl,  2007)  suggested  using  a  somewhal  larger  number  of  iterations  (100) 
lhan  wilh  ICA.  Nonelheless,  in  practice  wvRNA  simplicily  makes  il  Ihe  faslesl  algorilhm. 

Finally,  all  of  Ihe  algorilhms,  excepl  for  wvRN,  can  be  augmented  wilh  cautious  learning  via 
PLUL.  Executing  PLUL  requires  repeatedly  running  Ihe  CC  algorilhm  wilh  differenl  values  of  Ihe 
selected  parameter.  We  used  9-13  differenl  parameter  values,  and  hence  Ihe  cosl  of  PLUL  vs.  nol 
using  PLUL  is  aboul  one  order  of  magnilude. 

6.  Evaluation  Methodology 

This  section  describes  our  hypolheses  and  Ihe  melhod  lhal  we  use  lo  evaluate  Ihem. 

6.1  Hypotheses 

Table  3  summarizes  our  live  hypolheses.  As  described  in  Section  1,  we  expecl  cautious  behaviors 
lo  be  more  imporlanl  when  Ihere  is  a  higher  probabilily  of  incorrecl  relational  inference.  Thus,  each 


2798 


Cautious  Collective  Classieication 


Data  characteristic 

Type  of  caution 
considered 

Hypothesis:  relative  gain  of  caution 
will  increase  as  value  of  characteristic... 

Autocorrelation 

Inference 

...increases  (HI) 

Attribute  predictiveness 

Inference 

...decreases  (H2) 

Link  density 

Inference 

...decreases  (H3) 

Labeled  proportion 

Inference 

...decreases  (H4) 

Labeled  proportion 

Learning 

...decreases  (H5) 

Table  3:  The  five  hypotheses  that  we  investigate. 


hypothesis  varies  one  data  eharaeteristie  that  impaets  the  likelihood  of  sueh  errors.  In  partieular, 
hypotheses  H1-H4  vary  a  data  eharaeteristie  to  measure  the  impaet  of  eautious  inferenee,  whieh 
Seetion  7  will  evaluate  for  different  pairs  of  eautious  and  non-eautious  inferenee  algorithms.  We 
define  the  “relative  gain  of  eautious  inferenee”  as  the  differenee  between  the  aeeuraeies  of  two  sueh 
algorithms  (e.g.,  Gibbs  vs.  GibbsMc)-  Hypothesis  H5  also  varies  a  data  eharaeteristie,  but  does  so 
to  measure  the  “relative  gain  of  eautious  learning”  (i.e.,  eomparing  performanee  with  vs.  without 
PLUL). 

•  HI:  The  relative  gain  of  cautious  inference  increases  with  increasing  autocorrelation. 

Larger  autoeorrelation  implies  that  relations  are  more  predietive,  and  will  be  learned  as  sueh 
by  the  elassifier.  This  magnifies  fhe  impaet  that  an  error  in  a  predieted  label  ean  have  on 
linked  nodes.  Therefore,  we  expeet  eautious  inferenee  algorithms  to  improve  elassifieation 
by  a  greater  margin  in  sueh  eases. 

•  H2:  The  relative  gain  of  cautious  inference  increases  with  decreasing  attribute  predic¬ 
tiveness  (ap).  Decreased  ap  implies  a  greater  potential  of  errors/uncertainty  in  the  predicted 
labels.  The  effect  of  cautiously  using  uncertain  labels  should  be  greater  in  such  cases. 

•  H3:  The  relative  gain  of  cautious  inference  increases  with  decreasing  link  density  (Id). 

When  the  number  of  links  is  high,  a  single  mispredicted  label  has  relatively  little  impact  on 
its  neighbors.  As  the  number  of  links  decreases,  however,  a  single  misprediction  can  cause 
larger  relational  feature  uncertainty,  increasing  the  need  for  caution. 

•  H4:  The  relative  gain  of  cautious  inference  increases  with  decreasing  labeled  proportion 
(Ip),  When  Ip  is  high,  only  a  few  of  each  node’s  neighbors  have  estimated  labels  (most  are 
known  with  certainty).  Consequently,  there  is  less  uncertainty  in  relational  feature  values,  and 
less  need  to  use  estimated  labels  cautiously. 

•  H5:  The  relative  gain  of  cautious  learning  with  PLUL  increases  with  decreasing  labeled 
proportion(/p).  As  with  H4,  when  Ip  is  high  there  is  less  uncertainty  in  the  relational  features. 
Thus  there  is  less  disparity  between  the  fully  correct  training  set  (where  classifier  parameters 
were  learned)  and  the  test  set.  Consequently,  we  expect  PLUL,  which  compensates  for  any 
such  disparity,  to  matter  less  when  Ip  is  high. 

6.2  Tasks 

We  will  evaluate  three  general  tasks  (see  Section  2.3): 


2799 


McDowell,  Gupta  and  Aha 


Parameter 

Abbrev. 

Values  tested  (defaults  in  bold) 

Nodes  per  graph 

Nj 

250 

Number  of  class  labels 

Nc 

5 

Number  of  attributes 

Na 

10 

Degree  of  homophily 

dh 

0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9 

Link  density 

Id 

0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9 

Attribute  predictiveness 

ap 

0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9 

Labeled  proportion 

ip 

0%,  10%,  20%,  40%,  50%,  60%,  80% 

Table  4:  Synthetic  data  parameters.  Defaults  were  chosen  based  on  averages  from  Cora  and  Cite- 
seer,  two  commonly  studied  data  sets  for  CC. 


1 .  Out-of-sample  task:  Here  the  test  set  does  not  contain  or  link  to  any  known  nodes,  as  with 
Neville  and  Jensen  (2000),  Taskar  et  al.  (2002),  and  Sen  and  Getoor  (2006). 

2.  Sparse  in-sample  task:  Here  some  of  the  test  nodes,  but  only  a  few,  have  known  labels 
(we  use  10%).  We  focus  particularly  on  this  task,  because  some  researchers  argue  that  it 
is  the  most  realistic  scenario,  since  often  networks  are  large,  and  acquiring  known  labels  is 
expensive  (Bilgic  and  Getoor,  2008).  This  was  the  primary  scenario  considered  by  the  recent 
work  of  McDowell  et  al.  (2007 a, b),  Bilgic  and  Getoor  (2008),  and  Gallagher  et  al.  (2008). 

3.  Dense  in-sample  task:  Here  a  substantial  number  of  test  nodes  may  have  known  labels  (we 
use  50%).  This  task  was  the  one  recently  evaluated  by  Sen  et  al.  (2008). 

6.3  Data 

We  evaluate  the  hypotheses  over  both  synthetic  and  real-world  data  sets,  which  we  describe  below. 
We  use  the  synthetic  data  to  highlight  how  different  data  characteristics  affect  the  relative  gain  of 
cautious  behaviors,  then  the  real-world  data  sets  to  validate  these  findings. 

6.3.1  Synthetic  Data 

We  use  a  synthetic  data  generator  (see  Table  4)  with  two  components:  a  Graph  Generator  and  an 
Attribute  Generator.  The  Graph  Generator  has  four  inputs:  Nj  (the  number  of  nodes/instances),  Nc 
(the  number  of  classes),  Id  (the  link  density),  and  dh  (the  degree  of  homophily).  For  each  link, 
dh  controls  the  probability  that  the  linked  nodes  have  the  same  class  label;  higher  values  yield 
higher  autocorrelation  (see  Appendix  A  for  details).  The  hnal  number  of  links  is  approximately 
N]/{\  —  ld),  and  the  final  link  degrees  follow  a  power  law  disfribufion,  which  is  common  in  real 
nefworks  (Bollobas  el  al.,  2003).  The  Graph  Generator  is  idenlical  lo  fhal  used  by  Sen  el  al.  (2008); 
see  fhal  arlicle  for  more  delail. 

To  make  Ihis  a  praclical  sludy,  we  chose  defaull  parameler  values  fhal  mimic  characferislics  of 
Iwo  frequenlly  sludied  CC  dala  sels,  Cora  and  Cifeseer  (McDowell  el  al.,  2007a;  Neville  and  Jensen, 
2007;  Sen  el  al.,  2008).  In  parlicular,  Ac=5  classes  and  Table  4  shows  addilional  defaull  values.  We 
chose  A7=250  nodes,  a  smaller  value  lhan  wilh  Cora/Cileseer,  to  reduce  CC  execulion  lime,  bul 
larger  values  did  nol  change  Ihe  performance  Irends. 


2800 


Cautious  Collective  Classieication 


The  Attribute  Generator  generates  10  {Na)  binary  attributes.  Our  design  for  it  is  motivated  by  our 
observations  of  common  CC  data  sets.  We  found  that,  unlike  synthetic  models  used  in  prior  studies, 
different  attributes  vary  in  their  utility  for  class  prediction.  To  simulate  this,  we  associate  each 
attribute  h  with  a  particular  class  c^,  where  m  =  h  mod  Nc,  and  vary  the  strength  of  each  attribute’s 
predictiveness  based  on  the  value  of  h.  In  particular,  for  node  v,  with  class  y;,  the  probability  that 
v,’s  h*^  attribute  Xih  has  value  1  depends  upon  the  class  y,  as  follows: 


P{xih  =  1  \yi  =  ck) 


'  0.15 +  (ap- 0.15)-^ 
0.1 
0.05 
,  0.02 


if  k  =  h  mod  Nc 
if  k  =  (h  —  f)  mod  Nc 
ifk={h  +  \)  mod  Nc 
otherwise. 


The  first  line  indicates  that,  when  y,-  (=  Ck)  is  the  class  associated  with  attribute  h  (i.e.,  k  = 
h  mod  Nc),  then  P{xih  =  1  ly;  =  Ck)  ranges  from  0.15  for  /i  =  0  to  ap  (a  constant  representing  the 
strength  of  attribute  predictiveness)  for  /i  =  9.  As  a  result,  each  of  the  five  classes  has  fwo  allribufes 
associated  wifh  fhaf  class,  buf  some  classes  have  associafed  allribufes  lhal  are  more  useful  for  pre- 
diclion.  However,  Xiu  may  also  be  1  when  y,  is  some  olher  class  besides  an  “associated  class”;  Ihe 
nexl  Ihree  lines  encode  Ibis  class  ambiguily.  This  ambiguily/noise  is  based  on  our  observafions  of 
Cora  and  Cileseer  and  is  similar  lo  Ihe  binomial  dislribulion  used  by  Sen  el  al.  (2008). 

Finally,  we  use  a  parameter  for  lesl  sel  generation  called  Ip  (labeled  proporfion),  which  is  Ihe 
proportion  of  lesl  nodes  wilh  known  labels.  We  use  defaull  values  of  lp=0%,  lp=l0%,  and  lp=50% 
for  Ihe  Ihree  lasks  defined  in  Section  6.2.  Nodes  lo  be  labeled  are  selected  uniformly  al  random 
from  Ihe  lesl  sel  until  Ihe  desired  value  of  Ip  is  reached.  In  conlrasl,  some  real  dala  sels  are  likely 
lo  exhibil  non-uniform  clustering  of  known  nodes.  We  conjeclure  lhal  such  dala  sels  will  have  a 
smaller  “effective”  Ip,  since  each  known  node  will  have,  on  average,  fewer  direcl  connections  lo 
unknown  nodes.  For  inslance,  a  dala  sel  wilh  lp= f0%  may  behave  more  like  a  dala  sel  wilh  lp=5% 
where  Ihe  labels  are  more  uniformly  dislribuled.  Such  effecls  should  be  examined  in  fulure  work. 


6.3.2  “Real- WORLD”  Data  Sets 

We  consider  Ihe  following  five  “real-world”  dala  sels  (see  Table  5).  “Real-world”  is  a  somewhal 
subjective  term;  however,  all  of  Ihe  dala  sels  are  based  on  nalurally  arising  nelworks  and  have  been 
used  in  some  form  for  previous  research  on  relational  learning. 


1.  Cora  (McCallum  et  al.,  2000b):  A  collection  of  machine  learning  publications  categorized 
into  seven  classes.  The  relational  links  are  (directed)  cilalions. 

2.  Citeseer  (Lu  and  Getoor,  2003a):  A  collection  of  research  publications  drawn  from  Cile- 
Seer.  The  relational  links  are  (directed)  cilalions. 

3.  WebKB  (Craven  et  al,,  1998):  A  collection  of  web  pages  from  four  computer  science  de- 
parlmenls  categorized  into  six  classes  (Facully,  Sludenl,  Slaff,  Course,  ResearchProjecl,  or 
Olher).  “Olher”  is  problematic  because  if  is  too  general,  representing  74%  of  Ihe  pages.  Like 
Taskar  el  al.  (2002),  we  discarded  all  “Olher”  pages  lhal  did  nol  have  al  leasl  Ihree  oulgoing 
links,  yielding  a  lolal  of  1541  inslances  of  which  30%  are  Olher.  The  relational  links  are  Ihe 
(directed)  hyperlinks  among  Ihese  pages. 


2801 


McDowell,  Gupta  and  Aha 


Cora 

CiteSeer 

WehKB 

HepTH 

Terror 

Characteristics  of  entire  graph 

Instances/nodes 

2708 

3312 

1541 

2194 

645 

Attributes  (non-relat.  feats.)  available 

1433 

3703 

100 

387 

106 

Attributes  used  (max) 

too 

100 

too 

100 

100 

Attributes  used  (default) 

20 

20 

40 

40 

2 

Link/relation  directedness 

directed 

directed 

directed 

directed 

undirected 

Type  of  relational  features  used 

in,out 

in,out 

in,out,co 

in,out 

linksto 

Class  labels 

7 

6 

6 

7 

6 

Total  relational  features  used 

14 

12 

18 

14 

6 

Links  per  node 

3.9 

2.7 

5.8(64.6) 

8.9 

9.8 

Autocorrelation 

0.88 

0.83 

0.30(0.53) 

0.54 

0.16 

Characteristics  of  each  test  set  (on  average) 

Instances/nodes 

400 

400 

335-469 

300 

150 

Number  of  folds 

5 

5 

4 

5 

3 

Links  per  node 

2.7 

2.7 

5.7(61.0) 

4.3 

12.3 

Approx,  link  density 

0.23 

0.23 

0.64(0.97) 

0.53 

0.79 

Autocorrelation 

0.85 

0.84 

0.38(0.53) 

0.64 

0.24 

Label  consistency 

0.78 

0.75 

0.21(0.90) 

0.61 

0.56 

Approximate  homophily 

0.74 

0.70 

0.05(0.88) 

0.54 

0.47 

Table  5:  Summary  of  the  five  real-world  data  sets  used,  in  and  out  features  eompute  separate  values 
based  on  ineoming  or  outgoing  links,  while  linksto  features  make  no  sueh  distinetion.  co 
features  are  based  on  virtual  eo-eitation  links;  nodes  A  and  B  are  linked  via  a  co  link  if 
there  exists  some  node  C  with  outgoing  links  to  both  A  and  B.  For  Web  KB,  the  first  statistie 
listed  is  eomputed  ignoring  eo-links,  while  the  statistie  in  parentheses  is  eomputed  using 
only  eo-links.  Label  eonsisteney  is  the  pereentage  of  links  eonneeting  nodes  with  the  same 
label;  Appendix  A  defines  this  and  approximate  homophily.  Seetion  6.9  deseribes  the 
“default”  number  of  attributes  used. 


4.  HepTH:  A  eolleetion  of  journal  artieles  in  the  field  of  theoretieal  high-energy  physies,  de¬ 
rived  from  the  Proximity  Hep-Th  database  (http://kdl.es.umass.edu/data/hepth).  The  original 
data  set  did  not  have  any  single  elass  label,  but  some  pages  were  elassitied  into  topie  sub- 
types.  Among  pages  with  one  sueh  subtype,  we  seleeted  all  artieles  belonging  to  the  six 
most  eommon  subtypes,  yielding  1404  artieles.  To  ereate  a  more  eonneeted  graph,  we  also 
seleeted  all  artieles  with  a  date  after  2001  that  linked  to  at  least  two  of  the  1404  pre-seleeted 
artieles.  There  were  790  sueh  artieles,  whieh  we  treated  as  having  a  elass  label  of  “Other.” 
The  relational  links  are  the  (direeted)  eitations  among  all  2194  artieles. 

5.  Terror  (Zhao  et  al.,  2006):  A  eolleetion  of  terrorist  ineidents,  drawn  from  the  Profile  in  Ter¬ 
ror  projeet  (http://protilesinterror.mindswap.org).  The  ineidents  are  non-uniformly  distributed 
into  six  eategories:  Bombing  (44%),  WeaponAttaek  (38%),  Kidnapping  (14%),  Arson  (2%), 
NBCRAttaek  (1%),  and  Other Attaek  (1%).  The  relational  links  indieate  (undireeted)  geo- 
graphieal  eo-loeation. 

These  data  sets  are  intended  to  demonstrate  CC  performanee  on  a  range  of  data  eharaeteristies. 
For  instanee,  CC  would  be  expeeted  to  be  very  helpful  for  Cora  and  CiteSeer,  where  autoeorrelation 
is  high,  but  not  very  helpful  for  Terror. 


2802 


Cautious  Collective  Classieication 


6.4  CC  Algorithms 

We  evaluate  the  ten  algorithms  listed  in  Table  1,  plus  Content  Only  {CO),  a  non-relational  baseline 
that  uses  only  attributes.  For  eaeh  of  the  four  main  seetions  in  Table  1,  there  is  one  non-eautious 
variant  {ICA,  Gibbs^c^  LBPnc,  and  wvRNjca+nc)  and  one  or  two  eautious  variants  (ICAc,  ICAkh, 
Gibbs,  LBP,  wvRNrl,  and  wvRNjca+c)-  The  wvRN  algorithms  also  serve  as  a  eolleetive,  relational- 
only  baseline. 

Based  on  previous  work  (Neville  and  Jensen,  2000;  MeDowell  et  ah,  2007a),  the  /CA-based 
algorithms  used  n  =  10  iterations;  more  iterations  did  not  improve  performanee.  For  Gibbs,  we 
used  1500  iterations,  with  a  random  restart  every  300  iterations,  and  ignored  the  first  100  iterations 
after  a  restart  for  burn-in.  Additional  iterations  did  not  improve  performanee.  Gibbs^c  converged 
in  far  fewer  iterations  beeause  it  does  not  sample  and  is  deterministie;  we  used  n  =  50. 

For  LBP,  we  assumed  that  eaeh  parameter  was  a  priori  independent  and  had  a  zero-mean  Gaus¬ 
sian  prior  with  a  default  uniform  prior  varianee  of  =  10,  whieh  is  similar  to  the  values  reported 
in  previous  work  (e.g..  Sen  and  Getoor  2006;  Neville  and  Jensen  2007).  We  used  MAP  estimation 
to  estimate  these  parameters  based  on  eonjugate  gradient,  eontrols  how  tightly  the  parameters  fit 
to  the  training  data;  Table  2  shows  the  alternative  values  of  eonsidered  by  PLUL  to  eonstrain  this 
fitting  for  the  relational  parameters. 

6.5  Classifiers 

To  aeeount  for  possible  variations  in  overall  CC  performanee  trends  due  to  the  effeet  of  the  un¬ 
derlying  elassifier,  we  tested  three  loeal  elassifiers  with  eaeh  CC  algorithm  wherever  applieable 
(this  exeludes  LBP  and  wvRN).  Seetion  5.5  already  deseribed,  for  eaeh  elassifier,  fhe  key  relafional 
feafure  whose  value  is  learned  by  PLUL;  we  now  provide  more  defail  on  eaeh  elassifier  and  ifs 
applieafion  of  PLUL. 

The  firsf  elassifier  is  Naive  Bayes  (NB).  PLUL  was  used  fo  learn  a  for  fhe  Diriehlef  prior  of 
eaeh  relafional  feafure.  The  seeond  elassifier  is  Logisfie  Regression  (LR).  We  used  MAP  estimation 
with  Gaussian  priors  to  learn  the  parameters  for  LR;  PLUL  learned  an  appropriate  varianee  for 
the  prior  of  eaeh  relational  feature.  The  final  elassifier  is  k-Nearesf  Neighbor  (kNN);  we  used  k=\\. 
When  eompufing  similarify,  affribufes  were  assigned  a  weighf  of  1.  PLUL  learned  fhe  weighl  wr 
for  eaeh  relational  feafure.  Weighfed  similarify  was  used  for  vofing. 

For  eaeh  elassifier.  Table  2  shows  fhe  speeifie  values  eonsidered  by  PLUL.  The  “defaull”  value 
shown  (e.g.,  a  =  1.0  for  NB)  was  used  in  fwo  ways.  Firsf,  fhe  defaulf  was  used  as  fhe  paramefer 
value  for  all  affribufes.  Seeond,  fhe  defaulf  was  used  for  a  manual  selling  for  fhe  paramefer  value 
for  all  relafional  fealures  when  PLUL  is  nol  being  used.  When  PLUL  was  used,  fhe  learned  value 
was  used  instead  for  fhe  relafional  fealures. 

The  ICAc  algorilhm  requires  a  elassifier  lhaf  ean  ignore  missing  relafional  feafure  values.  kNN 
and  NB  ean  do  Ibis  easily:  kNN  by  dropping  fhe  feafure  from  fhe  similarity  ealeulalion  and  NB 
by  skipping  fhe  feafure  in  probabilily  eompulalion.  For  LR,  however,  dealing  wilh  missing  values 
is  a  eurrenl  researeh  fopie  (e.g.,  Fung  and  Wrobel  1989),  wilh  lypieal  feehniques  ineluding  mean 
value  subslilulion  or  multiple  impufalion.  However,  for  CC  fhe  siluafion  is  less  eomplex  lhan  fhe 
more  general  ease,  beeause  missing  values  oeeur  only  for  fhe  fesl  sel,  only  for  relafional  fealures, 
and  lypieally  only  when  all  neighbors  of  a  node  have  missing  labels.  Thus,  we  ean  learn  several 
LR  elassifiers:  one  lhaf  uses  all  relational  fealures,  and  one  for  eaeh  eombinalion  of  fealures  lhaf 
may  be  missing  simullaneously  (for  our  dala.  Ibis  is  al  mosl  4).  Experimenfally,  we  found  Ihis 


2803 


McDowell,  Gupta  and  Aha 


to  perform  better  than  mean  value  substitution,  though  the  differenee  was  slight  beeause  missing 
values  were  rare.  These  results  are  eonsistent  with  those  of  Saar-Tseehansky  and  Provost  (2007)  on 
non-relational  data.  Seetion  7.8  diseusses  this  effeet  in  more  detail. 

6.6  Node  Representation 

Each  node  is  represented  by  a  set  of  (non-relational)  attributes  and  relational  features.  Algorithms 
based  on  LBP  and  wvRN  reason  directly  with  each  individual  link,  and  their  algorithms  thus  di¬ 
rectly  define  the  effective  relational  features  used.  Approaches  based  on  ICA  and  Gibbs,  however, 
use  some  kind  of  aggregation  function  to  compute  their  relational  feature  values.  We  first  describe 
the  possible  aggregation  functions  for  these  features,  then  separately  describe  the  complete  repre¬ 
sentation  for  the  synthetic  and  real  data. 

6.6.1  Relational  Features  Considered 

We  considered  three  different  types  of  relational  features: 

•  Count:  This  type  represents  the  number  of  neighbors  that  belong  to  a  particular  class.  For 
each  node  i,  there  is  one  such  feature  fi{c)  per  class  label  c.  The  value  of  fi{c)  =  Neighbors i{c), 
which  is  the  number  of  nodes  linked  to  node  i  that  have  a  known  or  current  estimated  label  of 
c.  For  instance,  in  step  C  of  Figure  1,  /2(P)  =  1  and  f2{S)  =  2. 

•  Proportion:  This  feature  is  like  “count”,  except  that  the  feature  value  represents  the  propor¬ 
tion  of  neighbors  that  have  a  particular  label,  rather  than  the  raw  number  of  such  neighbors. 
For  this  feature,  fi{c)  =  Neighborsi{c) /Neighborsi{*),  where  N eighborsi{*)  is  the  number  of 
nodes  linked  to  node  i  that  have  any  current  label  (known  or  estimated,  but,  for  ICAq,  exclud¬ 
ing  those  nodes  whose  label  was  set  to  missing  because  of  low  confidence).  If  Neighborsi{*) 
is  zero,  then  /,  (c)  is  set  to  missing.  For  example,  if  proportion  features  were  being  used,  then 
the  feature  values  for  step  C  of  Figure  1  would  be  fiiP)  =  1/3  and  fiiS)  =2/3. 

•  Multiset:  Proportion  and  count  features  aggregate  the  labels  of  a  node’s  neighborhood  to 

produce  a  single  numerical  value  for  each  possible  label.  During  inference,  this  aggregate 
value  is  then  compared  against  the  mean  value  from  the  training  set  (with  NB  or  FR),  or 
compared  against  the  aggregate  values  for  nodes  in  the  training  set  (with  kNN).  In  contrast,  a 
“multiset”  feature  uses  a  single  multiset  to  represent  the  current  labels  of  a  node’s  neighbors. 
For  instance,  if  multiset  features  were  used,  then  for  step  C  of  Figure  1,  /2  =  This 

has  the  same  information  content  as  with  count  features,  but  can  be  exploited  differently  by 
some  local  classifiers.  In  particular,  during  NB  inference,  each  label  in  the  multiset  (excluding 
missing  labels)  is  separately  used  to  update  the  conditional  probability  that  a  node  has  true 
label  c.  This  is  the  “independent  value”  approach  introduced  by  Neville  et  al.  (2003b)  and 
used  by  Neville  and  Jensen  (2007).  However,  this  approach  does  not  directly  apply  to  FR  or 
kNN. 

6.6.2  Synthetic  Data  Node  Representation 

Each  node  is  represented  by  ten  binary  attributes  and  some  relational  features.  Because  represen¬ 
tation  choices  can  affect  how  well  a  CC  algorithm  handles  the  uncertainty  of  estimated  labels,  for 
each  local  classifier-based  algorithm  we  considered  count  and  proportion  relational  features,  as  well 


2804 


Cautious  Collective  Classieication 


as  multiset  features  when  using  NB.  For  eaeh  trial,  we  evaluated  the  two  or  three  possible  types  of 
relational  features  with  eross-validation  (evaluating  aeeuraey  on  the  holdout  set),  then  seleeted  the 
feature  type  with  the  highest  aeeuraey  to  use  for  testing.  When  PLUL  was  used,  PLUL  was  also 
applied  to  eaeh  feature  type;  the  best  performanee  (on  the  holdout  set)  reported  by  PLUL  for  eaeh 
feature  type  was  then  used  for  this  feature  seleetion.  Seetion  7.8  deseribes  whieh  feature  types  were 
ehosen  most  often  for  eaeh  loeal  classifier.  Since  there  are  5  class  labels  for  the  synthetic  data  and 
links  are  undirected,  there  were  5  relational  features  when  using  count  or  proportion  features,  and  1 
relational  feature  (whose  value  is  a  multiset)  when  using  multiset. 

6.6.3  Real-world  Data  Node  Representation 

For  all  five  data  sets  we  used  binary  attributes  that  indicated  the  presence  or  absence  of  a  particular 
word.  For  Web  KB,  these  words  were  from  the  body  of  each  HTML  page;  we  selected  the  100 
most  frequent  such  words,  which  was  all  that  was  available  in  our  version  of  the  data  set.  For 
symmetry,  and  because  adding  more  words  had  a  small  impact  on  performance,  we  likewise  set 
up  the  remaining  data  sets  to  select  100  words  as  attributes.  For  Cora  and  CiteSeer,  these  words 
were  taken  from  the  body  of  the  publications;  as  with  previous  work  (McDowell  et  ah,  2007a)  we 
selected  the  100  words  with  the  highest  information  gain  in  the  training  set  to  use.  For  Terror,  the 
words  come  from  hand-written  descriptions  of  each  incident  provided  with  the  data  set;  we  selected 
the  first  100  of  the  106  available  attributes.  For  HepTH,  we  selected,  based  on  information  gain,  the 
100  highest- scoring  words  from  the  article  title  or  the  name  of  the  corresponding  journal. 

For  relational  features,  we  again  considered  the  proportion,  multiset,  and  count  features,  and 
selected  the  best  feature  type  using  cross-validation  as  described  above.  All  of  the  data  sets  except 
Terror  had  directed  links.  For  these  data  sets,  we  computed  separate  relational  feature  values  based 
on  incoming  and  outgoing  links.  In  addition,  previous  work  has  shown  WebKB  to  have  much 
stronger  autocorrelation  based  on  co-citation  links  than  on  direct  links  (see  Table  5).  However, 
using  such  links  can  sometimes  be  problematic.  Thus,  we  evaluate  two  data  sets:  “WebKB”  and 
“WebKB-i-co”.  For  WebKB,  algorithms  use  in  and  out  links  (“direct”  links).  For  WebKB-i-co, 
algorithms  use  in,  out,  and  co-links,  except  wvRN  uses  only  co-links,  as  suggested  by  Macskassy 
and  Provost  (2007)  (see  Section  7.6). 

6.7  Training/Test  Splits  Generation 

For  the  synthetic  data,  we  generate  training,  holdout,  and  test  graphs  that  are  disjoint.  Likewise,  for 
WebKB,  the  data  was  already  divided  into  four  splits  (one  for  each  department)  that  can  be  used  for 
cross-validation. 

For  the  other  real  data  sets,  we  must  manually  construct  training  and  test  splits  from  the  original 
graph.  Sen  et  al.  (2008)  suggest  a  technique  based  on  snowball  sampling  that  involves  picking 
a  random  starting  node  and  iteratively  growing  a  split  around  that  node,  where  the  class  of  the 
next  node  to  be  selected  is  sampled  from  the  overall  class  distribution.  However,  we  found  that 
low  graph  connectivity  often  prevented  the  algorithm  from  producing  a  final  subgraph  whose  class 
distribution  resembled  the  whole  graph’s.  Instead,  we  created  the  following  technique,  similarity- 
driven  snowball  sampling:  given  the  whole  graph  G,  pick  a  random  starting  node  and  add  it  to  the 
split  Gi.  At  each  step,  consider  the.  frontier  F  of  G\  (all  those  nodes  not  in  G\  that  link  to  some  node 
in  Gi).  Among  all  labels  c  that  exist  in  F,  select  the  class  label  c'  such  that  adding  some  node  of 
label  c'  to  Gi  would  maximize  the  similarity  (inverse  Euclidean  distance)  of  the  class  distributions 


2805 


McDowell,  Gupta  and  Aha 


of  Gi  and  G.  Given  this  c',  randomly  select  some  node  in  F  of  class  c'  and  add  it  to  Gi-  Repeat  this 
random  selection  and  insertion  until  Gi  is  of  the  desired  size. 

We  run  this  algorithm  in  parallel  for  Ns  different  subgraphs,  using  Ns  different  seeds,  and  permit 
each  node  to  be  inserted  into  only  one  subgraph.  This  results  in  Ns  disjoint  splits  that  have  similar 
class  distributions  and  that  can  be  used  for  Aj-fold  cross  validation.  We  set  Ns  =  5  for  Cora,  Citeseer, 
and  HepTH,  and  Ns  =  3  for  the  smaller  Terror. 

Table  5  shows  some  of  the  characteristics  of  the  generated  test  sets  vs.  the  original,  complete 
graphs.  In  general,  the  autocorrelation  and  number  of  links  per  node  are  similar,  indicating  that  the 
sampling  procedure  did  not  dramatically  change  the  average  characteristics  of  the  graph.  While  the 
splitting  procedure  effectively  removes  links,  the  average  degree  of  the  test  sets  may  still  be  greater 
than  with  the  original  graph  if  high-degree  subsets  of  the  original  are  selected. 

6.8  Test  Procedure 

We  first  consider  the  synthetic  data.  For  each  control  condition  (i.e.,  data  generated  with  a  combina¬ 
tion  of  dh,  ap.  Id,  and  Ip  values,  see  Table  4)  we  ran  25  random  trials.  For  each  trial,  we  generated 
training,  holdout,  and  test  data  sets  of  250  nodes  each.  All  training  is  performed  on  the  fully  la¬ 
beled  training  set.  The  holdout  set,  when  not  used  for  PLUL,  was  merged  with  the  training  set.  We 
measured  classification  accuracy  on  the  test  set,  excluding  all  nodes  with  “known”  labels. 

For  the  real-world  data  sets,  each  experiment  involves  using  all  of  the  relational  features  shown 
in  Table  5  and  a  fixed  number  of  affribufes  (Aa)-  We  vary  At  from  2  fo  100  (recall  fhaf  for  all  dafa 
sefs  100  aflribules  were  selecfed  for  experimenfafion).  For  each  setting  of  Na,  we  perform  As-fold 
cross-validafion,  where  Ns  is  3,  4,  or  5,  depending  on  fhe  dafa  sef.  Each  one  of  fhese  3  fo  5  frials  is 
associated  wifh  one  subgraph  (fhe  fesf  sef),  and  fhe  remaining  2-4  subgraphs  comprise  fhe  fraining 
sef.  We  fhen  apply  PLUL  by  fraining  on  half  of  fhe  fraining  sef  and  using  fhe  ofher  half  as  fhe 
holdouf  sef.  Afler  PLUL  selecfs  fhe  besf  paramefer  setting,  we  re-frain  on  fhe  whole  fraining  sef  and 
evaluafe  accuracy  on  fhe  fesf  sef.  If  PLUL  is  nol  used,  fraining  likewise  uses  fhe  whole  fraining  sef. 

We  reporf  resulfs  wifh  accuracy  in  order  fo  ease  comprehension  of  fhe  resulfs  and  fo  facilifafe 
comparison  wifh  some  of  fhe  mosf  relevanf  related  work  (e.g.,  Sen  el  ah,  2008;  Macskassy  and 
Provosl,  2007).  Resulfs  wifh  area  under  fhe  ROC  curve  (AUC)  for  fhe  majorily  class  demonsfrafed 
similar  frends. 

6.9  Statistical  Analysis 

We  conducted  two  distinct  types  of  analysis.  Lirst,  to  compare  algorithms  for  a  single  control 
condition,  we  used  a  one-tailed  paired  t-test  accepted  at  the  95%  confidence  level.  Lor  every  such 
test  each  “test  point”  is  the  accuracy  over  a  single  trial’s  test  graph.  Lor  example,  for  the  synthetic 
data  there  are  25  trials  for  each  control  condition,  and  thus  a  single  t-test  compares  25  pairs  of 
accuracies  (e.g.,  ICAc  vs.  ICA).  In  all  cases  the  test  graphs  used  by  these  t-tests  are  disjoint,  for 
both  the  synthetic  and  the  real  data. 

Second,  we  performed  linear  regression  slope  tests.  In  particular,  for  hypotheses  H1-H4,  we 
compared  two  algorithms  (e.g.,  ICAc  vs.  ICA)  for  each  independent  variable  X  (e.g.,  Id)  as  follows: 
Lor  each  trial,  we  computed  the  difference  in  the  algorithms’  classification  accuracies  (e.g.,  for  the 
synthetic  data,  225  such  differences  for  25  trials  and  9  values  of  Id).  We  performed  linear  regression 
(Y  =  a  +  bX),  where  the  accuracy  difference  is  the  dependent  variable  Y  and  X  is  the  independent 
variable  (e.g.,  Id).  The  estimated  value  of  slope  b,  when  non-zero,  indicates  an  increasing  (-I-) 


2806 


Cautious  Collective  Classieication 


or  decreasing  (— )  trend.  Regression  produces  a  p  value  associated  with  the  slope  that  indicates  the 
significance  level  for  hypothesis  testing;  we  accept  when  p  <  0.05.  For  hypothesis  H5,  the  equations 
are  the  same  but  we  compare  a  single  CC  algorithm  with  and  without  PLUL. 

For  the  synthetic  data,  the  analysis  is  straight-forward  and  we  use  the  data  generation  parameters 
dh,  ap,  Id,  and  Ip  as  the  independent  variable  for  regression.  Analysis  for  the  real  data  sets  requires 
more  explanation.  For  instance,  each  computed  subgraph  of  a  data  set  has  similar  autocorrelation, 
so  regression  for  HI  (where  autocorrelation  is  the  X  value)  cannot  be  performed  on  a  single  data 
set.  Instead,  we  combine  the  trials  of  all  the  real  data  sets  into  one  analysis,  where  the  indepen¬ 
dent  variable  is  the  measured  autocorrelation  of  the  corresponding  data  set  (we  include  Web  KB, 
but  exclude  WebKB-i-co  because  it’s  not  clear  how  to  compute  its  autocorrelation  with  direct  links 
combined  with  co-citation  links).  In  addition,  our  results  show  that  when  attribute  predictiveness 
is  high,  there  is  less  need  for  caution.  Thus,  to  prevent  any  interactions  between  autocorrelation 
and  caution  from  being  obscured  by  high  attribute  predictiveness,  we  use  fewer  than  100  attributes 
for  these  experiments.  In  particular,  for  each  data  set  we  evaluated  the  baseline  CO  algorithm  with 
varying  numbers  of  attributes  Na,  and  selected  the  number  that  yields  an  average  accuracy  closest 
to  50%.  Table  5  shows  the  resulting  default  number  of  attributes  for  each  data  set. 

For  H2  (attribute  predictiveness),  we  can  directly  vary  the  number  of  attributes,  so  we  can 
perform  regression  for  each  data  set  separately.  However,  attribute  predictiveness  is  typically  not  a 
linear  function  of  the  number  of  attributes.  Thus,  for  H2  we  perform  regression  where  the  dependent 
variable  is  the  accuracy  of  CO  for  each  trial  (as  a  surrogate  for  attribute  predictiveness). 

We  do  not  directly  evaluate  H3  for  the  real  data  sets  (see  Section  7). 

For  H4  and  H5  (varying  labeled  proportion),  we  directly  vary  Ip,  so  we  can  compute  separate 
results  for  each  data  set.  Moreover,  Ip  is  suitable  for  direct  use  as  the  dependent  variable.  As  with 
HI,  we  use  the  default  number  of  attributes  for  each  data  set  in  order  to  avoid  having  high  attribute 
predictiveness  obscure  the  interaction  of  caution  and  Ip.  We  omit  nonsensical  points  (e.g.,  wvRN 
when  lp=0%)  from  all  of  the  analyses. 

Finally,  for  each  hypothesis  we  also  perform  a  pooled  analysis.  For  the  synthetic  data,  this 
involves  pooling  the  results  of  all  the  cautious  CC  algorithms,  then  performing  the  slope  regression 
test.  For  the  real-world  data,  we  pool  the  results  across  both  the  CC  algorithms  and  each  of  the  real 
data  sets.  In  addition,  to  account  for  differences  in  the  data  sets,  we  perform  a  multiple  regression 
analysis  that  includes  autocorrelation  as  one  of  the  input  variables  (except  for  HI).  In  particular,  we 
fit  the  data  to  the  line  T  =  a  -|-  b\Xi  +  b2X2,  where  Xi  is  the  variable  in  question  (e.g..  Ip  for  H4  or 
H5)  and  X2  is  the  autocorrelation  of  the  data  set.  The  X2  term  factors  out  differences  due  only  to 
autocorrelation,  thus  making  the  other  trends  more  clear.  The  p-value  corresponding  to  bi  is  then 
used  for  hypothesis  testing. 

6.10  Implementation  Validation 

To  validate  the  implementation  of  our  algorithms,  we  replicated  three  different  synthetic  data  gen¬ 
erators:  those  used  by  Sen  and  Getoor  (2006),  Neville  and  Jensen  (2007),  and  Sen  et  al.  (2008). 
We  then  replicated  some  of  the  experiments  from  these  papers.  While  several  of  our  CC  algorithm 
variants  were  not  evaluated  in  any  of  these  earlier  papers,  we  were  able  to  compare  results  for  ICA, 
Gibbs,  and  LBP,  with  the  LR  and  NB  classifiers  as  appropriate,  and  found  very  consistent  results. 
Section  8.4  discusses  one  exception. 


2807 


McDowell,  Gupta  and  Aha 


LBP  is  the  most  challenging  algorithm  to  implement  and  to  get  to  converge.  To  deal  with 
such  problems,  Sen  et  al.  (2008)  seeded  LBP’s  learning  process  with  weights  learned  from  ICA. 
Alternatively,  we  found  that  seeding  with  values  estimated  from  empirical  counts  over  the  data, 
combined  with  limiting  the  maximum  step  size  of  the  search  to  prevent  oscillation,  worked  well. 
With  these  enhancements,  LBP  achieved  equivalent  accuracy  to  that  reported  by  Sen  and  Getoor 
(2006),  and,  when  PLUL  was  applied,  significantly  improved  it  for  the  cases  of  high  homophily 
and  link  density  (where  LBP’s  accuracy  had  been  very  poor).  In  contrast,  we  found  that  LBP  could 
replicate  the  performance  of  Sen  et  al.  (2008),  but  that  in  this  case  PLUL  had  little  effect.  Section  8.4 
explains  the  data  characteristics  of  that  study  (effectively  high  Ip)  that  led  to  this  result. 

7.  Evaluation  Results 

This  section  presents  our  experimental  results.  Section  7.1  presents  a  summary  of  the  results.  Sec¬ 
tion  7.2  explains  how  we  present  the  detailed  results,  and  subsequent  sections  discuss  these  detailed 
results  for  each  hypothesis.  We  focus  on  the  sparse  in-sample  task,  so  we  accept  a  hypothesis  if  it 
is  confirmed,  for  the  Ip =70%  case,  by  the  pooled  analysis  on  both  the  synthetic  data  and  the  real- 
world  data.  Hypotheses  H4-H5  involve  varying  Zp;  here  we  accept  the  hypothesis  if  confirmed  on 
bofh  fhe  synfhefic  and  real  dafa. 

When  a  local  classifier  is  needed,  all  resulfs  below  use  NB  by  defaull.  We  found  fhaf  NB’s 
performance  was  heller  or  equivalenl  lo  lhal  of  LR  and  kNN  in  almosl  every  case  (see  Seclion  8.4), 
for  bofh  fhe  synlhelic  and  real  dafa  sels,  and  lhal  using  LR  or  kNN  led  lo  very  similar  performance 
Irends.  Below  we  menlion  some  of  Ihe  resulls  for  LR  and  kNN;  see  Ihe  online  appendix  for  more 
delail.  In  addilion,  PLUL  is  used  everywhere  unless  olherwise  specified;  see  analysis  and  motivation 
in  Section  7.7. 

7,1  Summary  of  Results 

Tables  6-8  summarize  our  overall  resulls  for  hypolheses  H1-H5.  Each  fable  presenls  resulls  for  Ihe 
synlhelic  dafa  on  Ihe  left  and  (where  applicable)  for  Ihe  real  dala  sels  on  Ihe  right  Each  reported 
value  represenls  Ihe  estimated  slope  of  Ihe  line  measuring  Ihe  difference  belween  a  cautious  and  a 
non-caulious  CC  algorilhm  as  Ihe  corresponding  x-parameler  (e.g.,  autocorrelation)  is  varied  (see 
Section  6.9).  Only  values  lhal  were  slalislically  differenl  from  zero  are  reported;  olherwise  a  dash 
is  shown.  Bold  values  indicate  a  significanl  slope  lhal  supporls  Ihe  corresponding  hypolhesis.  Eor 
inslance,  H2  predicted  lhal  caution  becomes  more  imporlanl  as  allribule  predictiveness  decreases 
(a  negative  slope).  Thus,  Table  7  shows  a  minus  sign  for  Ihe  expected  slope  and  all  significant 
negative  slopes  are  shown  in  bold.  Where  possible,  we  show  separate  resulls  for  Ihe  oul-of-sample, 
sparse  in-sample,  and  dense  in-sample  lasks  (using  Ip  =  0%,  10%,  and  50%).  However,  to  simplify 
Ihe  lable  Ihe  real-world  dala  resulls  for  H2  are  shown  only  wilh  Zp=10%;  Section  7.4  describes  olher 
resulls. 

The  fables  show  slrong  supporl  for  hypolheses  HI,  H2,  and  H4.  In  particular,  we  accepl  HI,  H2, 
and  H4  because  Ihe  pooled  analyses  find  significanl  slopes  in  Ihe  expected  direction;  non-pooled  re¬ 
sulls  also  demonslrale  consislenl  support  Thus,  Ihe  dala  supporl  Ihe  claims  lhal  each  cautious  infer¬ 
ence  algorilhm  oulperforms’  ils  non-caulious  varianl  by  increasing  amounls  when  autocorrelation 

7.  Technically,  the  slope  results  don’t  by  themselves  show  that  the  cautious  algorithms  “outperform”  the  non-cautious 

algorithms — only  that  the  relative  performance  of  the  cautious  algorithms  is  improving  in  the  hypothesized  direction. 


2808 


Cautious  Collective  Classieication 


Syn.  data  Real-world  data 

yX  P  P  P  P 

HI:  auto-correlation 

ICAc  vs.  ICA 

H- 

-hO.13  -hO.13  -hO.03 

— 

-hO.27 

-hO.15 

ICAkh  vs.  ICA 

H- 

n.a.  -h0.04  -h0.02 

n.a. 

-hO.15 

-hO.13 

Gibbs  vs.  Gibbsf^c 

H- 

-hO.18  -hO.15  -hO.03 

+0.27 

-hO.25 

-hO.15 

LBP  vs.  LBPmc 

H- 

H-O.IO  -hO.08  — 

— 

— 

— 

wvRNrl  vs.  wvRNica+nc 

H- 

n.a.  -hO.43  — 

n.a. 

-hO.41 

-hO.07 

WvRNjca+C  vs.  WvRNjca+NC 

+ 

n.a.  -h0.40  — 

n.a. 

-hO.67 

H-O.lO 

Pooled 

+ 

-hO.13  -hO.21  H-O.Ol 

-hO.13 

-hO.30 

H-O.ll 

Table  6:  Summary  of  results  for  hypothesis  HI.  All  values  shown  represent  a  slope  that  is  signifi- 
eantly  different  from  zero;  values  in  bold  support  HI.  For  HI,  at  a  given  Ip  value  all  data 
sets  (exeept  WebKB+eo)  are  used  to  eompute  a  single  slope  value  by  treating  the  auto- 
eorrelation  of  the  data  set  as  the  X  value.  All  algorithms  used  PLUL  where  applieable. 
“n.a.”  indieates  that  the  algorithm  doesn’t  make  sense  at  lp=0%. 


Syn.  data  Real-world  data  {Ip  =  10%) 

H2:  attribute  predictiveness 

ICAc  vs.  ICA 

- 

-0.10  -0.25  -0.12 

-0.60  -0.61  -0.29  —  —  — 

ICAkh  vs.  ICA 

- 

n.a.  -0.06  -0.08 

-0.14  —  -0.16  —  —  — 

Gibbs  vs.  GibbsNC 

- 

-0.09  -0.27  -0.14 

-0.44  -0.50  —  —  —  — 

LBP  vs.  LBPj^c 

- 

-0.12  -0.28  -0.05 

-0.46  -0.35  —  n.c.  -0.29  — 

Pooled 

- 

-0.10  -0.22  -0.10 

-0.23  (over  all  real  data  and  CC  algs.) 

H3:  link  density 

ICAc  vs.  ICA 

- 

-0.08  -0.09  -0.03 

ICAkh  vs.  ICA 

- 

n.a.  h-0.06  -0.02 

Gibbs  vs.  GibbsNC 

- 

-0.09  -0.07  -0.04 

(not  evaluated) 

LBP  vs.  LBPnc 

- 

h-0.12  -0.23  -0.04 

wvRNrl  vs.  wvRNica+nc 

- 

n.a.  -0.18  -0.05 

wvRNica+c  vs.  wvRNica+nc 

- 

n.a.  0.11  -0.03 

Pooled 

- 

—  -0.07  -0.04 

Table  7:  Summary  of  results  for  hypotheses  H2  and  H3.  As  before,  all  values  shown  represent  a 
slope  that  is  signifieantly  different  from  zero;  values  in  bold  support  the  eorresponding 
hypothesis.  All  algorithms  used  PLUL  where  applieable.  “n.e.”  indieates  where  LBP  did 
not  eonverge. 


is  higher  (HI),  attribute  predietiveness  is  lower  (H2),  and/or  the  labeled  proportion  is  lower  (H4). 
In  addition,  the  data  show  eonsistent  interaetions  among  these  faetors.  In  partieular,  the  strength  of 

However,  the  raw  accuracies  do  show  consistent  performance  gains  for  the  cautious  algorithms,  so  in  this  context  the 
slope  results  do  show  the  cautious  algorithms  outperforming  the  others  by  increasing  amounts. 


2809 


McDowell,  Gupta  and  Aha 


Syn.  data 

4^ 

Real-world  data 

CP" 

H4:  labeled  proportion  (comparing  cautious  vs.  non-cautious  algorithm) 

ICAc  vs.  ICA 

- 

-0.09 

-0.11  -0.14  -0.05  —  —  — 

ICAkh  vs.  ICA 

- 

-0.02 

-0.06  —  -0.05  -0.29  —  — 

Gibbs  vs.  GibbsNc 

- 

-0.11 

-0.14  -0.13  0.05  0.28  —  — 

LBP  vs.  LBPnc 

- 

-0.05 

—  —  —  n.c.  —  — 

WvRNrl  vs.  WvRNjca+NC 

- 

-0.28 

-0.37  -0.39  -0.18  —  —  — 

WvRNica+C  vs.  WvRNjca+NC 

- 

-0.27 

-0.36  -0.32  -0.15  -0.31  —  h-0.28 

Pooled 

- 

-0.12 

-0.07  (over  all  real  data  and  CC  algs.) 

H5:  labeled  proportion  (comparing  with  PLUL  vs.  without  PLUL) 

ICAc 

- 

-0.02 

—  —  —  —  —  — 

ICAkh 

- 

-0.01 

_  _  -0.02  —  —  — 

ICA 

- 

— 

—  —  —  -0.18  —  — 

Gibbs 

- 

-0.02 

—  —  -0.04  —  -0.07  — 

LBP 

- 

-0.03 

—  —  —  n.c.  —  — 

Pooled 

- 

-0.02 

-0.01  (over  all  real  data  and  CC  algs.) 

Table  8:  Summary  of  results  for  hypotheses  H4  and  H5,  whieh  both  vary  the  labeled  proportion 
{Ip).  As  before,  all  values  shown  represent  a  slope  that  is  signifieantly  different  from  zero; 
values  in  bold  support  the  eorresponding  hypothesis.  For  H4,  all  algorithms  used  PLUL 
where  applieable. 


the  dependenee  (the  magnitude  of  the  slope)  generally  deereases  as  the  labeled  proportion  inereases 
from  10%  to  50%  (Seetion  7.4  diseusses  the  differenees  between  lp=0%  and  10%  in  more  detail). 

Table  7  shows  weaker  support  for  H3  (eautious  inferenee  gain  inereases  as  link  density  de¬ 
ereases).  H3  is  supported  by  most  of  the  synthetie  data  eases  and  by  the  pooled  analysis  for  lp=\0% 
and  lp=50%,  but  the  magnitude  of  the  slopes  indieates  a  weaker  effeet.  Moreover,  Seetion  7.5  exam¬ 
ines  these  results  more  elosely  and  proposes  that  a  more  appropriate  hypothesis  would  state  that  the 
eautious  inferenee  gain  is  greatest  when  link  density  is  moderate.  This  eonelusion  is  also  tentatively 
supported  by  a  per-node  degree  analysis  of  the  real  data. 

Table  8  also  shows  weaker  support  for  H5.  The  synthetie  data  results  supported  H5  for  every 
algorithm  exeept  ICA.  In  addition,  for  18  of  the  29  possible  eases  shown  for  the  real  data  sets, 
the  eomputed  slope  was  negative,  as  predieted  by  H5.  However,  the  magnitude  of  these  slopes 
indieate  a  weaker  effeet  than  with  HI,  H2,  or  H4.  This  deereased  magnitude,  in  eonjunetion  with 
the  smaller  number  of  trials  for  the  real  data,  leads  to  only  4  of  those  18  slopes  reaehing  statistieal 
signifieanee.  Nonetheless,  by  eombining  trials  aeross  algorithms  and  data  sets,  the  pooled  analysis 
does  find  signifieant  (but  small)  negative  slopes  for  both  the  synthetie  and  real  data,  so  we  aeeept 
H5.  This  indieates,  as  expeeted,  that  eautious  learning  with  PLUL  is  most  important  when  Ip  is 
small;  Seetion  7.7  also  demonstrates  that  in  this  ease  PLUL  ean  provide  substantial  performanee 
gains. 


2810 


Cautious  Collective  Classieication 


In  addition  to  these  results  for  eaeh  hypothesis,  regarding  relative  performanee  trends  as  data 
eharaeteristies  vary,  our  results  also  show  statistieally  signifieant  differenees  between  the  eautious 
and  non-eautious  algorithms  for  at  least  some  of  the  data  eonditions.  These  differenees  are  eon- 
sistent  with  the  aeeepted  hypotheses.  For  instanee,  using  the  default  synthetie  data  eharaeteristies, 
eaeh  eautious  algorithm  showed  a  signifieant  performanee  gain  over  its  non-eautious  variant,  and 
the  amount  of  this  gain  inereased  as  autoeorrelation  inereased,  attribute  predietiveness  deereased, 
or  labeled  proportion  deereased. 

7.2  Explanation  of  Results  Presentation 

In  the  following  seetions,  we  present  several  figures  fhaf  eompare  CC  algorifhmie  performanee.  In 
fhese  figures  some  eonfrollable  paramefer  is  fhe  x-axis  and  fhe  y-axis  is  fhe  resulfanf  aeeuraey  for  a 
given  algorifhmie  varianf,  averaged  over  all  frials.  For  insfanee.  Figure  8  plofs  aeeuraey  vs.  fhe  de¬ 
gree  of  homophily  {dh).  Eaeh  figure  eompares  eaufious  and  non-eaufious  varianfs  of  a  parfieular  CC 
algorifhm:  ICA,  Gibbs,  LBP,  or  wvRN.  In  addition,  for  fhe  CC  algorifhms  fhaf  use  a  loeal  elassifier 
(ICA  and  Gibbs),  we  offen  inelude  resulfs  for  fhe  non-relational  algorifhm  CO  for  eomparison. 

In  eaeh  seefion  below,  we  use  fhese  resulfs  fo  deseribe  fwo  kinds  of  analysis.  Firsf,  we  aeeepf 
or  rejeef  a  hypofhesis,  based  on  fhe  pooled  regression  slope  fesf.  This  analysis  eonfirms  or  fails  fo 
eonfirm  fhaf  fhe  imporfanee  of  fhe  eautious  feehniques  does  change  in  fhe  expeefed  direefion  as  some 
dafa  paramefer  varies,  buf  does  nof  evaluafe  how  important  fhe  eaufious  feehniques  are  in  improving 
performanee.  To  answer  fhe  laffer  quesfion,  we  reporf  on  a  seeond  analysis  fhaf  evaluafes,  using 
paired  f-fesfs,  whefher  fhe  eautious  feehniques  perform  signifieanfly  better  fhan  fhe  non-eaufious 
alfernafives  (see  Seefion  6.9). 

Each  figure  has  embedded  sfafisfical  informafion  corresponding  fo  some  of  fhese  f-fesfs.  In 
particular,  each  non-eaufious  CC  varianf  is  plotted  wifh  a  x  marker,  while  eaufious  CC  varianfs  are 
plotted  wifh  a  friangle  (where  multiple  eaufious  varianfs  exisf,  fwo  friangle  orienfafions  are  used:  V 
and  A).  Eor  a  parfieular  x-value,  if  fhe  plotted  friangle  is  filled  in  (solid  color),  fhen  fhaf  eaufious 
varianf  had  accuracy  fhaf  was  signifieanfly  differenf  from  fhe  accuracy  of  the  corresponding  non- 
cautious  variant.  Hollow  friangles  insfead  indicafe  no  significanl  difference.  This  nofafion  does  not 
direcfly  indicafe  ofher  significance  comparisons  (e.g.,  befween  fhe  fwo  eaufious  varianfs  ICAc  and 
ICAkh)',  where  necessary  we  describe  such  resulfs  in  fhe  fexf.  Eor  example,  in  Eigure  8,  fhe  graph  in 
fhe  fhird  column  of  fhe  firsl  row  (LBP  af  lp=0%)  shows  fhaf  LBP  signifieanfly  oulperforms  LBPnc 
when  dh=0.6  (note  fhe  filled  friangle).  However,  for  dh=0.5,  LBPA  small  gain  is  nof  sfafisfically 
significanl  (hollow  friangle). 

When  lp=0%,  ICAkh  is  equivalenl  fo  ICA,  so  resulfs  for  ICAxn  are  nof  shown.  Also,  LBP  wifh 
WebKB-i-co  did  nof  converge  due  fo  fhe  very  high  number  of  links,  so  resulfs  for  fhaf  case  are  nof 
considered  (cf.,  Taskar  el  ah,  2002). 

7.3  Result  1:  The  Relative  Gain  of  Cautious  Inference  Increases  with  Increasing 

Autocorrelation 

Table  6  reports  that  for  HI,  for  the  sparse  in-sample  task  (Zp=10%),  the  pooled  regression  analyses 
found  all  significant  positive  values  for  the  slope  B.  Thus,  we  accept  HI.  In  addition,  all  the  non- 
pooled  analyses  found  significant  positive  values.  The  only  exception  was  LBP  on  the  real  data  sets, 
which  had  a  positive,  non-significant  slope  {b  =  -1-0.03). 


2811 


McDowell,  Gupta  and  Aha 


lp=0% 


Degree  of  homophily  (dh) 


0  0.2  0.4  0.6  0.8  1 

Degree  of  homophily  (dh) 


0  0.2  0.4  0.6  0.8  1 

Degree  of  homophily  (dh) 


0  0.2  0.4  0.6  0.8  1 

Degree  of  homophily  (dh) 


0  0.2  0.4  0.6  0.8  1 

Degree  of  homophily  (dh) 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 


lp=0% 


WVRNrl 

WVRN|ca.c 

WVRN|ca.nc 


0  0.2  0.4  0.6  0.8  1 

Degree  of  homophily  (dh) 


ii» 

0  0.2  0.4  0.6  0.8  1 

Degree  of  homophily  (dh) 

lp=10% 


Degree  of  homophily  (dh) 


0  0.2  0.4  0.6  0.8  1 

Degree  of  homophily  (dh) 


0  0.2  0.4  0.6  0.8  1 

Degree  of  homophily  (dh) 


0  0.2  0.4  0.6  0.8  1 

Degree  of  homophily  (dh) 


0  0.2  0.4  0.6  0.8  1 

Degree  of  homophily  (dh) 


Figure  8:  Results  for  the  synthetic  data  as  the  degree  of  homophily  {dh)  varies.  Section  7.2  ex¬ 
plains  how  filled  triangles  indicate  statistical  significance.  Some  of  fhe  gains  are  small 
buf  consisfenf,  leading  fo  significance,  as  in  fhe  boffom  righf  graph. 


For  lp=0%  and  lp=50%,  fhe  pooled  analyses  and  mosf  individual  analyses  show  fhe  same  posi¬ 
tive  slopes  (on  fhe  real  dafa  for  ICAc  vs.  ICA  af  lp=0%,  fhe  slope  was  b  =  -|-0. 11,  buf  fhe  p-value  was 
jusf  over  fhe  significance  fhreshold),  as  we  also  found  wifh  LR  and  kNN.  The  reduced  significance 
and  magnifude  of  fhe  slopes  when  lp=50%  is  also  consisfenf  wifh  our  expecfafions,  since  fhe  overall 
imporfance  of  caution  should  decrease  as  Ip  increases  (see  hypofheses  H4  and  H5).  Secfion  7.4 
explains  more  for  fhe  lp=0%  case. 

Figure  8  shows  defailed  performance  frends  for  fhe  synlhefic  dafa.  Here  each  column  presenfs 
resulfs  for  differenf  varianfs  of  a  single  CC  algorifhm  {ICA,  Gibbs,  LBP,  and  wvRN),  and  each  row 
shows  resulfs  for  a  differenf  value  of  Ip.  The  x-axis  varies  homophily  (which  direcfly  increases 
aufocorrelafion)  and  fhe  y-axis  reporfs  average  accuracy. 

This  figure  confirms  fhaf  when  homophily  is  very  low,  CC  offers  little  gain,  and  fhus  fhe  cau¬ 
tious  varianfs  perform  equivalenfly  fo  fhe  non-caufious  varianfs  (and,  excepf  for  wvRN,  fo  fhe  non- 


2812 


Cautious  Collective  Classification 


lp=0% 


Attribute  predictiveness  (ap) 
lp=10% 


Attribute  predictiveness  (ap) 
lp=10% 


Attribute  predictiveness  (ap) 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 


lp=0% 


WVRNrl 

WVRN|ca.c 

WVRN|ca.nc 


0  0.2  0.4  0.6  0.8  1 

Attribute  predictiveness  (ap) 


Attribute  predictiveness  (ap) 


Attribute  predictiveness  (ap) 


0  0.2  0.4  0.6  0.8  1 

Attribute  predictiveness  (ap) 


1 

0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0 


lp=10% 

WVRNrl 

WVRN|ca,c 

WVRN|ca,nc 

♦  »  »  ♦  » 

xxxxxxxxx 


0  0.2  0.4  0.6  0.8  1 

Attribute  predictiveness  (ap) 


Figure  9:  Results  for  the  synthetic  data  as  attribute  predictiveness  (ap)  varies. 


relational  baseline  CO).  As  the  strength  of  relational  influence  (as  well  as  the  potential  for  incorrect 
relational  inference)  increases  with  higher  homophily,  the  relative  gain  of  the  cautious  methods  in¬ 
creases  substantially  (e.g.,  at  lp=\0%,  gains  for  NB-based  algorithms  rise  from  4-5%  at  dh=0.5  to 
9-12%  at  dh=0.9).  The  gains  from  caution  are  statistically  significant  in  most  cases  when  dh>0.3. 
Results  with  LR  and  kNN  show  very  similar  trends  (see  online  appendix). 

Figure  8  also  confirms  that  as  Ip  increases,  the  cautious  and  non-cautious  variants  perform  more 
similarly.  However,  even  for  lp=50%,  the  cautious  variants  maintain  a  significant,  though  smaller, 
advantage.  In  the  other  results  discussed  below,  the  same  trend  of  very  similar  performances  at 
lp=50%  was  evident.  Likewise,  the  graphs  for  lp=0%  are  similar  to  those  for  lp=\0%.  Thus,  we 
defer  most  results  for  lp=0%  or  50%  to  the  online  appendix. 

7.4  Result  2:  The  Relative  Gain  of  Cautious  Inference  Increases  as  Attribute  Predictiveness 
(ap)  Decreases 

Table  7  reports  that,  for  lp=lO%,  the  regression  analyses  found  all  significant  negative  slopes  (as 
expected)  for  the  synthetic  data.  Likewise,  in  almost  all  cases  we  found  significant  negative  slopes 
for  the  real  data  sets  that  have  substantial  autocorrelation  (Cora,  Citeseer,  HepTH,  and  WebKB-i-co), 
except  for  WebKB-i-co  (which  had  very  erratic  performance  with  all  the  algorithms).  We  accept  H2, 
because  the  pooled  analysis  found  negative  slopes  at  /p=10%  for  both  synthetic  and  real  data;  this 
result  also  holds  at  lp=0%  and  50%. 

Figure  9  shows  detailed  performance  trends  for  the  synthetic  data  as  the  x-axis  varies  ap.  For 
instance,  for  lp=lO%,  when  the  attribute  predictiveness  (ap)  is  0.6  (the  default),  ICAc  and  Gibbs 
outperform  their  non-cautious  variants  by  6-7%.  However,  as  ap  decreases  to  0.2,  label  uncertainty 


2813 


McDowell,  Gupta  and  Aha 


increases  (as  evidenced  by  the  drop  for  CO),  causing  the  relative  gain  of  caution  to  increase  to  20%. 
LBP  shows  very  similar  results. 

Results  for  lp=0%  are  mostly  similar,  but  with  an  interesting  twist.  In  this  case,  the  relative 
gain  of  cautious  CC  increases  as  ap  decreases,  as  with  lp=\0%.  However,  this  gain  peaks  at  ap=0.2 
or  0.3,  then  declines  as  ap  continues  to  decrease.  When  attribute  predictiveness  is  very  low,  and 
there  are  no  known  labels  to  help  seed  the  inference  (i.e.,  lp=0%),  then  even  the  cautious  algorithms 
have  difficulty  exploiting  relational  information,  and  achieve  accuracy  only  moderately  above  the 
baseline  CO.  However,  even  in  this  case  the  cautious  algorithms  maintain  some  small,  statistically 
significant  advantage  over  the  their  non-cautious  variants  (which  at  ap=0.1  do  little  better  than  CO). 
Also,  observe  that  1C  Ac,  Gibbs,  and  LBP  all  improve  substantially  for  the  lp=\0%  case  (compared 
to  lp=0%),  even  though  only  ICAc  explicitly  favors  the  provided  known  labels  in  its  inference 
process.  In  this  case,  using  caution  appears  to  be  the  important  performance  factor,  regardless  of 
what  specific  behavior  provides  fhaf  caution. 

Figures  10  and  11  provide  similar  resulfs  for  fhe  real  dafa  sefs  wifh  lp=l0%,  where  fhe  x-axis 
is  now  fhe  number  of  affribufes  used,  which  correlafes  wifh  overall  affribufe  predictiveness.  In 
general,  fhe  frends  shown  are  similar  fo  fhose  already  observed  for  fhe  synfhelic  dafa.  In  particular, 
fhe  graphs  for  Cora,  Cifeseer,  HepTH,  and  WebKB  all  follow  fhe  same  pattern:  cautious  algorifhms 
oulperform  non-cautious  algorifhms  more  when  fhe  number  of  affribufes  is  low  (and  caufious  ICAc 
oulperforms  fhe  somewhaf  cautious  ICAkh)-  Consisfenf  wifh  HI,  fhe  magnifude  of  fhese  gains  varies 
wifh  aufocorrelafion:  larger  for  Cora  and  Cifeseer,  smaller  for  HepTH  and  WebKB,  and  non-exisfenf 
for  Terror  (where  aufocorrelafion  is  very  weak). 

There  are  fwo  excepfions  fo  fhe  similarifies  of  fhese  resulfs  wifh  fhe  synfhefic  dafa.  Firsf,  for 
some  dafa  sefs  Gibbs  and/or  LBP  perform  noficeably  worse  fhan  ICAc',  we  discuss  fhis  separafely  in 
Secfion  8.1.  Second,  WebKB-i-co  shows  fairly  errafic  performance  for  all  algorifhms  excepf  ICAkh- 
In  general,  fhe  co-cifafion  links  used  by  WebKB-rco  appear  fo  be  very  informalive  (peak  accuracy 
is  much  higher  fhan  wifh  WebKB),  buf  also  pofenfially  misleading.  This  may  be  a  funclion  of  fhe 
WebKB  graph  sfrucfure:  Table  5  shows  fhaf  co-cifafion  links  have  a  very  high  label  consisfency 
of  0.90  (implying  fhaf  classifiers  will  learn  a  sfrong  relational  dependence),  buf  fhis  may  be  biased 
by  fhe  presence  of  some  very  high  degree  nodes.  During  learning  fhe  co-cifafion  links  may  appear 
very  informative  on  average,  buf  fhis  sfrong  dependency  may  lead  fo  mispredictions  for  low-degree 
nodes,  leading  fo  fhe  observed  errafic  behavior. 

We  now  briefly  refurn  fo  fhe  slope  analysis  of  Table  7.  For  fhe  synfhefic  dafa,  fhe  negative  slopes 
for  H2  are  significanf  in  all  cases,  buf  generally  largesl  for  Zp=10%.  This  behavior  is  consisfenf  wifh 
our  previously  discussed  analyses  of  fhe  synfhelic  dafa:  when  lp=0%,  fhe  performance  of  caufious 
algorifhms  for  very  low  ap  is  diminished,  fhus  producing  a  smaller  slope  magnifude  fhan  when 
lp=\0%.  On  fhe  olher  hand,  fhe  more  general  observafion  fhaf  caufion  is  less  useful  when  Ip  is  high 
explains  why  fhe  magnifude  of  fhe  slopes  is  less  for  lp=50%  fhan  for  lp=l0%.  We  found  similar 
frends  for  fhe  real-world  dafa  sefs:  while  Table  7  shows  significanf  negative  slopes  for  H2  for  mosf 
cases  (excluding  fhe  errafic  WebKB-i-co  and  fhe  low  aufocorrelafion  Terror)  when  lp=\0%,  resulfs 
(nol  shown)  wifh  lp=0%  or  50%  indicale  slopes  of  reduced  magnifude  and/or  slopes  fhaf  do  nof 
reach  sfafisfical  significance.  However,  in  bolh  cases  fhe  pooled  analysis  still  indicafes  significanf 
negative  slopes  for  H2  (-0.05  for  lp=0%  and  -0.13  for  lp=50%). 


2814 


Cautious  Collective  Classification 


2  5  10  20  40  60  80  100 


Cora;  lp=10% 


Cora:  lp=10% 


Number  of  attributes 


Number  of  attributes 


Number  of  attributes 


Citeseer;  lp=10% 


Citeseer;  lp=1 0%  Citeseer;  lp=1 0% 


Number  of  attributes 


Number  of  attributes 


Number  of  attributes 


HepTH;  lp=10% 


HepTH;  lp=10% 


HepTH;  lp=10% 


Number  of  attributes 


Number  of  attributes 


Number  of  attributes 


Terror:  lp=10% 


Terror;  lp=10% 


Terror;  lp=10% 


Number  of  attributes 


Number  of  attributes 


Number  of  attributes 


Figure  10:  Results  for  four  of  the  real  data  sets  as  the  number  of  attributes  is  varied.  The  x-axis 
is  not  to  scale;  this  is  to  improve  readability  and  to  yield  a  more  linear  curve  for  the 
baseline  CO  algorithm,  thus  facilitating  comparison  with  Figure  9.  Because  there  are 
only  3-5  trials  for  the  real  data,  high  variance  sometimes  causes  substantial  gains  to  not 
be  statistically  significant. 

7.5  Result  3:  The  More  Cautious  Algorithms  Outperform  Non-Cautious  Algorithms  when 
Link  Density  {Id)  is  Moderate,  But  Have  Mixed  Results  When  Id  is  High 

For  the  synthetic  data,  the  results  in  Table  7  support  H3  for  all  algorithms  when  lp=50%,  for  most 
algorithms  when  lp=\0%,  and  for  only  two  algorithms  when  lp=0%.  The  pooled  analysis  finds. 


2815 


McDowell,  Gupta  and  Aha 


WebKB+co  (direct  and  co  links);  lp=10%  WebKB+co  (direct  and  co  links);  lp=10% 


Number  of  attributes 


Number  of  attributes 


WebKB  (only  direct  links):  lp=10% 


WebKB  (only  direct  links):  lp=10% 


WebKB  (only  direct  links):  lp=1 0% 


Number  of  attributes 


Number  of  attributes 


Number  of  attributes 


Figure  1 1 :  Results  for  the  WebKB  data  sets  as  the  number  of  attributes  is  varied.  With  WebKB+co, 
LBP  did  not  converge,  so  results  are  not  shown. 


Link  density  (Id) 


Link  density  (Id) 


Link  density  (Id) 


Link  density  (Id) 


Figure  12:  Results  for  the  synthetic  data  as  link  density  {Id)  varies. 


as  expected,  significant  negative  slopes  for  lp=lO%  and  lp=50%.  However,  without  corresponding 
pooled  results  for  the  real  data,  we  cannot  accept  H3.  Moreover,  the  results  we  present  below  will 
suggest  a  revision  to  H3. 

Figure  12  shows  the  results  as  Id  is  varied,  for  Zp=10%.  When  Id  is  low  to  moderate  (up  to 
ld=0.6),  the  cautious  algorithms  consistently  and  significantly  outperform  their  non-cautious  vari¬ 
ants.  We  had  hypothesized  that  this  advantage  would  decrease  as  link  density  increased,  because 
when  the  link  graph  is  dense,  the  relational  features  are  relatively  unaffected  by  a  few  incorrect 
labels,  and  thus  using  such  labels  cautiously  matter  less;  Figure  12  generally  reflects  this  trend.  In 
some  cases  the  non-cautious  algorithm  even  outperforms  the  cautious  algorithm  at  very  high  Id.  For 
instance,  at  ld=0.9  ICA  outperforms  the  more  cautious  ICAc  (though  not  significantly).  At  such 
high  link  density,  simply  using  all  available  information  with  ICA  may  work  better  than  ICAc’s 


2816 


Cautious  Collective  Classieication 


cautious  but  partial  use  of  estimated  labels — ^provided  that  accuracy  is  high  enough  that  errors  are 
few.  In  separate  experiments  we  confirmed  that  if  the  attribute  predictiveness  (and  thus  accuracy) 
was  lower,  ICAc  maintained  it’s  advantage  over  ICA  even  when  Id  was  very  high. 

While  these  results  generally  indicate,  as  expected,  that  the  gain  from  caution  decreases  as  Id 
becomes  high,  closer  examination  indicates  that  this  gain  from  caution  peaks  not  at  very  low  Id,  but 
at  moderate  Id.  In  particular,  the  gain  from  caution  peaks  when  Id  is  0.2  or  0.3  for  ICAc,  ICAkh, 
or  Gibbs,  and  when  Id  is  0.6  for  wvRNrl  and  wvRNjca+c-  In  hindsight,  this  effect  makes  sense:  as 
the  number  of  links  decrease,  there  is  less  relational  influence,  and  thus  less  probability  of  incorrect 
relational  influence,  so  caution  matters  less.  Another  effect  is  that  with  fewer  links,  there  are  fewer 
opportunities  for  a  cautious  algorithm  to  favor  one  node’s  predictions  over  another’s. 

To  further  analyze  these  trends,  we  turn  to  the  real  data.  We  did  not  attempt  to  directly  vary  the 
link  density  of  the  real  data  sets,  because  it’s  not  clear  how  to  realistically  add  links  to  an  existing 
data  set,  as  would  be  necessary  to  create  a  reasonable  range  of  link  densities  for  experimentation. 
However,  Table  9  examines  our  previous  results  for  the  real  data  sets,  showing  the  amount  of  cau¬ 
tious  gain  broken  down  by  the  link  degree  of  each  node.  This  approach  does  not  directly  correlate 
to  varying  the  overall  link  density,  so  our  conclusions  are  tentative,  but  it  does  provide  some  insight. 
We  focus  primarily  on  ICAc',  trends  with  other  algorithms  were  similar. 

The  results  support  our  previous  conjectures.  In  particular,  the  cautious  gain  generally  decreases 
for  the  highest  link  degrees,  even  going  negative  in  some  cases.  Moreover,  in  most  cases  the  cautious 
gain  also  decreases  for  the  lowest  link  degrees,  resulting  in  a  peak  for  the  cautious  gain  (shown  in 
bold  if  present)  at  moderate  link  degrees.  These  effects  generally  hold  true  for  the  synthetic  data 
and  for  the  real  data  sets  that  have  substantial  autocorrelation. 

We  now  return  to  Figure  12  to  consider  a  few  possible  exceptions.  First,  with  LBP,  accuracy 
decreases  with  increasing  Id,  is  erratic,  and  is  sometimes  better  with  LBR^c  than  with  LBP.  This  is 
not  surprising:  the  short  graph  cycles  caused  by  high  Id  produces  great  problems  for  LBP  (e.g..  Sen 
and  Getoor,  2006;  Sen  et  al.,  2008).  Even  these  LBP  accuracies  are  much  better  than  those  achieved 
without  PLUL  (see  Section  7.7). 

Second,  two  of  the  cautious  algorithms  {ICArh  and  wvRNjca+c)  performed  unexpectedly  well, 
continuing  to  significantly  outperform  the  non-cautious  variants  (and  even  alternative  cautious  vari¬ 
ants)  at  very  high  link  density.  Interestingly,  these  effects  also  occur  with  WebKB-i-co  (see  Fig¬ 
ures  11  and  14),  which  has  by  far  the  highest  link  density  of  the  real  data  sets.*  In  addition,  the 
superior  performance  of  ICAkh  at  high  Id  remains  even  when  the  local  classifier  is  changed  to  LR 
or  kNN  (see  Figure  19  in  the  online  appendix).  We  suspect  that  /CA/f„’s  advantage  arises  because  it 
both  achieves  a  better  starting  point  than  ICA  (by  favoring  known  labels  in  its  first  iteration)  and  ex¬ 
ploits  more  information  than  ICAc  (by  using  all  estimated  labels  in  subsequent  iterations — and  when 
Id  is  high  using  a  few  erroneous  labels  doesn’t  harm  performance).  For  wvRNjca+c^  its  advantage 
over  wvRNrl  must  arise  from  the  key  algorithmic  difference:  since  wvRNjca+c  is  a  hard-labeling 
algorithm,  it  gives  all  labeled  nodes  equal  weight  in  the  neighborhood  average  that  determines  the 
next  label  for  a  node.  When  link  density  is  high,  relying  on  this  simple  average  may  be  better  than 
wvRNrl’^  soft-labeling  estimation,  which  implicitly  gives  more  weight  to  nodes  with  more  extreme 

8.  At  first,  these  strong  performances  seem  to  conflict  with  Macskassy  and  Provost  (2007),  who  generally  find  wvRNrr 
outperforms  wvRNjca+c-  However,  two-thirds  of  their  data  sets  are  variants  of  WebKB,  but  where  all  “Other” 
pages  have  been  removed  from  the  classification  task.  This  change  makes  the  classification  problem  easier,  and  thus 
may  explain  the  discrepancy.  In  addition,  on  the  only  other  data  set  used  in  that  work  and  this  article  (Cora),  our 
performance  trends  are  very  similar. 


2817 


McDowell,  Gupta  and  Aha 


Degree  1-2 

Degree  3-5 

Degree  6-10 

Degree  11-20 

Synthetic  data,  using  NB+/CAc 

lp=  0% 

5.5% 

8.2% 

13.8% 

8.2% 

lp=10% 

5.2% 

9.7% 

8.6% 

10.6% 

lp=50% 

2.2% 

4.6% 

8.9% 

7.3% 

Average 

4.3% 

7.5  % 

10.4% 

8.7% 

Synthetic  data,  using  NB+G/fjfji 

lp=  0% 

8.5% 

13.6  % 

18.6% 

15.4% 

lp=10% 

6.3% 

9.7% 

12.7% 

10.6% 

lp=50% 

2.5% 

3.6% 

7.4% 

5.3% 

Average 

5.7% 

9.0% 

12.9% 

10.5% 

Real  data  with  substantial  autocorrelation,  using  NB+/CAc 

Cora 

7.9% 

10.9% 

10.5% 

-4.8% 

Citeseer 

15.8% 

20.5% 

14.6% 

-8.3% 

WebKB+co 

8.3% 

9.8% 

12.8% 

10.0% 

HepTH 

1.2% 

-4.3% 

3.3% 

4.0% 

Average 

8.3% 

9.2% 

10.3% 

0.2% 

Other  real  data  sets,  using  NB+/CAc 

WebKB 

5.8% 

1.5% 

-4.8% 

-6.0% 

Terror 

2.4% 

-5.7% 

0.0% 

0.0% 

Table  9:  Per-node  degree  results  showing  the  amount  of  gain  from  eaution  {ICAq  vs.  ICA  or  Gibbs 
vs.  Gibbs Nc)-  Eaeh  value  indieates  the  average  aeeuraey  gain  from  eaution  for  all  nodes 
in  the  test  set  within  the  given  link  degree  range  (nodes  with  degree  greater  than  20  were 
rare,  and  ignored  for  simplieity).  Within  eaeh  row,  a  value  is  in  bold  if  it  represents  a  elear 
peak,  with  monotonieally  deereasing  aeeuraeies  to  both  the  left  and  right  of  that  value. 
The  synthetie  data  used  the  default  settings.  The  real  data  sets  used  the  default  number  of 
attributes  and  lp=\0%. 


estimated  distributions.  In  both  eases,  however,  extending  Id  to  even  more  extreme  values  (e.g., 
ld=0.95)  does  eonfirm  the  overall  trend  of  the  amount  of  eautious  gain  deereasing  at  high  Id. 

As  expeeted,  we  found  that  these  performanee  differenees  disappeared  when  many  known  labels 
were  provided.  In  partieular,  at  high  link  density  and  lp=50%,  there  were  only  small  differenees 
between  ICAq,  ICAkh,  and  ICA,  or  between  wvRNrl,  wvRNjca+c^  and  wvRNica+nc-  In  addition, 
when  PLUL  was  used,  even  LBP  and  LBPnc  performed  on  par  with  ICAc  and  Gibbs  when  lp=50%, 
despite  the  ehallenges  of  LBP  with  high  Id. 

Overall,  our  results  suggest  that  a  more  appropriate  rendering  of  H3  should  indieate  that  the 
relative  gain  from  caution  will  peak  at  some  moderate  value  of  Id,  with  the  preeise  value  depending 
on  the  CC  algorithm  and  the  other  data  eonditions.  We  leave  eonfirmation  of  this  revised  hypothesis 
to  future  work. 

7.6  Result  4:  The  Relative  Gain  of  Cautious  Inference  Increases  as  the  Labeled  Proportion 
(Ip)  Decreases 

Table  8  reports  that,  as  Ip  varies,  the  regression  analyses  found  all  signifieant  negative  slopes  (as 
expeeted)  for  the  synthetie  data.  Likewise,  in  almost  all  eases  we  found  signifieant  negative  slopes 


2818 


Cautious  Collective  Classification 


Labeled  proportion  (Ip) 


Labeled  proportion  (Ip) 


Labeled  proportion  (Ip) 


Labeled  proportion  (Ip) 


Figure  13:  Results  for  the  synthetic  data  as  the  labeled  proportion  {Ip)  varies. 


for  the  real  data  sets  with  substantial  autocorrelation  (all  except  Terror  and  WebKB).  We  accept  H4, 
because  the  pooled  analyses  find  all  negative,  significant  slopes. 

For  the  real  data,  the  exceptions  to  H4’s  stated  trend  were  primarily  WebKB+co,  which  had 
very  erratic  performance  with  all  the  algorithms,  and  WebKB,  where  none  of  the  slopes  attained 
statistical  significance.  In  addition,  LBP  had  highly  variable  behavior  so  that  only  for  Citeseer  did 
the  slope  approach  statistical  significance  {p  =  .053,  just  over  the  threshold). 

Figure  13,  for  the  synthetic  data,  shows  the  performance  of  the  cautious  and  non-cautious  algo¬ 
rithms  converging  as  Ip  increases.  The  cautious  algorithms  maintain  a  significant  advantage  until 
lp=80%.  Observe  that  ICAku^  curve  lies  between  that  of  the  more  cautious  ICAq  and  the  non- 
cautious  ICA,  while  wvRNrl  and  wvRNica+c  obtain  the  same  results  with  their  two  different  ap¬ 
proaches  to  caution. 

Figure  14  shows  results  for  the  real  data  sets  as  Ip  is  varied.  This  figure  show  results  only 
for  wvRN,  since  results  were  previously  presented  for  the  other  algorithms  for  varying  numbers  of 
attributes,  and  the  Ip  graphs  don’t  add  additional  insight  for  those  algorithms. 

The  results  in  Figure  14  mirror  those  of  the  synthetic  data,  with  a  few  exceptions.  First, 
wvRNjca+c  does  poorly  on  Terror,  perhaps  because  of  the  low  autocorrelation.  Second,  with  We- 
bKB-i-co,  wvRNjca+c  outperforms  wvRNrl  when  Ip  is  low,  though  the  gains  are  not  quite  significant; 
this  effect  was  discussed  in  Section  7.5.  Finally,  the  accuracy  of  wvRN  for  WebKB  goes  down  with 
increasing  Ip.  WebKB  with  just  direct  links  has  some  autocorrelation  but  very  low  label  consistency 
(see  Table  5),  because  each  node  tends  to  link  in  certain  patterns  to  nodes  with  a  different  label 
from  itself  (cf.,  Macskassy  and  Provost,  2007).  Algorithms  based  on  wvRN  assume  homophily,  not 
such  more  complex  forms  of  autocorrelation.  Consequently,  increasing  Ip  only  serves  to  reduce 
accuracy  below  the  majority  class  baseline.  Running  wvRN  with  only  co-citation  links,  as  done  for 
WebKB-i-co,  works  much  better. 

7.7  Result  5:  The  Relative  Gain  of  Cautious  Learning  With  PLUL  Increases  as  the  Labeled 

Proportion  (Ip)  Decreases 

The  previous  results  compared  cautious  vs.  non-cautious  variants  of  a  particular  CC  algorithm,  in 
all  cases  using  PLUL.  We  now  justify  the  use  of  PLUL  and  examine  its  impact. 

The  bottom  of  Table  8  shows  the  regression  slope  results  for  H5,  where  the  x-axis  varies  the 
labeled  proportion  (Ip),  and  each  table  row  compares  a  single  CC  algorithmic  variant  when  using 
PLUL  vs.  not  using  PLUL.  As  expected,  the  slope  analysis  found  all  significant  negative  slopes  for 


2819 


McDowell,  Gupta  and  Aha 


Cora 


Citeseer 


HepTH 


Labeled  proportion  (Ip) 


Labeled  proportion  (Ip) 


Labeled  proportion  (Ip) 


WebKB  (only  co-links) 


WebKB  (only  direct  links) 


20%  40%  60%  80% 

Labeled  proportion  (Ip) 


Figure  14:  Results  for  variants  of  wvRN  on  the  real  data  sets,  as  Ip  is  varied.  For  the  first  WebKB 
results,  wvRN  uses  only  co-citation  links  (unlike  previous  results  with  other  algorithms, 
where  WebKB-i-co  used  direct  links  and  co-links  together;  see  Section  6.6.3).  Recall 
that  filled  triangles  indicate  statistical  significance,  but  only  for  comparing  the  cautious 
variant  (here,  wvRNrl  or  wvRNica+c)  vs.  the  non-cautious  variant  (wvRNjca+nc)- 

the  synthetic  data  (with  one  exception  where  the  p-value  was  close  to  the  threshold),  although  the 
magnitude  of  the  slopes  suggests  a  weak  trend.  For  the  real  data  sets,  while  1 8  of  the  29  possible 
slopes  were  in  the  expected  direction,  only  4  of  these  slopes  were  statistically  significant  (recall  that 
the  real  data  sets  have  available  only  3-5  trials,  making  significance  harder  to  achieve).  However, 
pooling  the  results  across  the  data  sets  and  algorithms  yields  a  significant  negative  slope  for  both 
the  synthetic  and  real  data,  so  we  accept  H5. 

Thus,  while  the  effect  (as  Ip  varies)  is  smaller  than  with  previous  hypotheses,  H5  indicates  the 
PLUL  provides  the  most  gain  when  Ip  is  small.  To  measure  the  magnitude  of  this  gain.  Table  10 
shows  the  impact  of  PLUL  when  lp=0%.  Each  row  shows  the  results  for  a  different  collective 
algorithm.  Results  are  given  for  each  algorithm  both  with  and  without  PLUL,  along  with  the  overall 
gain  from  PLUL.  Because  PLUL  interacts  closely  with  the  local  classifier,  we  show  results  here  for 
NB,  LR,  and  kNN  for  the  CC  algorithms  that  use  a  local  classifier.  CO  and  wvRN  are  unaffected  by 
PLUL,  and  thus  are  not  shown. 

In  general,  we  found  that  PLUL  improved  performance,  sometimes  substantially,  but  the  data 
regions  where  such  substantial  gains  occur  vary  by  classifier  and/or  CC  algorithm.  For  instance. 
Column  A  of  Table  10  shows  results  for  the  default  synthetic  data  settings.  Here,  PLUL  improves 
performance  for  almost  all  algorithms.  In  particular,  the  gains  range  from  -0.3%  to  10.8%,  with  an 
average  of  4.0%,  and  are  significant  in  9  of  the  14  cases.  Column  B  shows  results  where  the  attribute 
predictiveness  is  0.3  (instead  of  the  default  0.6).  In  this  case,  the  gains  due  to  PLUL  are  almost 


2820 


Cautious  Collective  Classieication 


A.)  Default  settings 

B.)  Low  attr.  predictiveness 

C.)  High  link  density 

With  PLUL?  1 

PLUL 

With  PLUL? 

PLUL 

With  PLUL? 

PLUL 

Yes 

No 

Gain 

Yes 

No 

Gain 

Yes 

No 

Gain 

Using  the  NB  local  classifier 

ICAc 

78.9 

77.8 

1.1 

58.2 

52.5 

5.7 

80.6 

72.0 

8.6 

ICA 

72.3 

72.6 

-0.3 

47.7 

46.8 

0.9 

77.0 

75.4 

1.6 

Gibbs 

81.8 

81.5 

0.3 

60.6 

55.9 

4.7 

80.8 

79.2 

1.6 

GibbsNc 

71.8 

71.1 

0.7 

46.7 

46.0 

0.7 

76.4 

74.2 

2.2 

Using  the  LR  local  classifier 

ICAc 

78.6 

74.1 

4.5 

56.8 

43.2 

13.6 

82.9 

73.9 

9.0 

ICA 

70.8 

68.5 

2.3 

48.5 

44.4 

4.1 

70.8 

72.5 

-1.7 

Gibbs 

76.5 

72.9 

3.6 

52.3 

50.8 

1.5 

77.6 

77.8 

-0.2 

GibbsNC 

70.3 

65.3 

5.0 

48.4 

43.2 

5.2 

71.4 

71.3 

0.1 

Using  the  kNN  local  classifier 

ICAc 

74.1 

69.0 

5.1 

51.4 

39.2 

12.2 

78.5 

65.4 

13.1 

ICA 

71.7 

64.2 

7.5 

48.4 

41.0 

7.4 

75.2 

74.7 

0.5 

Gibbs 

73.9 

70.0 

3.9 

54.4 

48.1 

6.3 

80.3 

79.7 

0.6 

GibbsNC 

71.7 

61.3 

10.4 

47.7 

38.9 

8.8 

75.0 

74.0 

1.0 

Using  LBi 

p 

LBP 

77.8 

76.4 

1.4 

55.7 

27.9 

27.8 

69.7 

21.5 

48.2 

LBPnc 

73.9 

63.1 

10.8 

45.5 

24.4 

21.1 

54.3 

31.2 

23.1 

Table  10:  Impact  of  PLUL  on  accuracy  with  the  synthetic  data,  for  CC  algorithms  where  PLUL 
applies,  at  lp=0%.  Gains  in  bold  are  statisticaly  significant. 


all  larger,  ranging  from  0.7%  to  27.8%  (average  of  8.6%),  and  are  significant  in  11  of  14  cases. 
These  results  are  consistent  with  H2:  when  attributes  are  less  predictive  of  the  class  label,  cautious 
techniques,  including  PLUL,  become  more  important.  Finally,  column  C  shows  results  where  the 
link  density  is  now  0.7  (instead  of  the  default  0.2);  here  the  gains  due  to  PLUL  are  more  varied. 
For  ICAc,  PLUL  remains  important  and  matters  even  more  than  with  the  default  data  settings.  We 
conjecture  that  this  is  because  with  so  many  links,  relational  influence  can  spread  very  quickly  in 
fhe  graph,  and  fhus  fhe  PLUL  process  is  very  imporfanf  fo  ensuring  fhaf  ICAc’s  confidence  measure 
selecfs  fhe  mosf  reliable  predicfions  during  fhe  firsl  few  iferafions.  Indeed,  when  Ip  is  instead  sef 
fo  10%  (fhus  providing  more  cerfain  esfimafes  for  fhe  early  iferafions),  PLUL  became  much  less 
imporfanf  for  ICAc-  LBP  has  known  issues  wifh  high  link  density,  buf  PLUL  helps  subsfanlially  fo 
ameliorate  fhem.  For  fhe  ofher  algorifhms,  fhe  increased  link  densify  leads  fo  PLUL  having  a  minor 
impacf,  consisfenf  wifh  H3. 

Table  11  shows  similar  resulfs  for  fhe  real  dafa  sefs,  where  resulfs  for  all  six  dafa  sefs  have 
been  pooled  fogefher.  Since  we  cannof  direcfly  vary  link  density,  we  instead  show  resulfs  wifh  fwo 
condifions.  On  fhe  leff  is  fhe  “fewer  affribufes”  case;  here  each  dafa  sef  uses  ifs  defaulf  number 
of  affribufes,  as  explained  in  Secfion  6.9.  On  fhe  righf  is  fhe  case  where  each  dafa  sef  uses  100 
affribufes. 

Compared  fo  resulfs  wifh  fhe  synfhefic  dafa.  Table  1 1  shows  less  evidence  for  fhe  effecliveness 
of  PLUL  wifh  fhe  real  dafa  sefs.  While  all  algorifhms  show  a  gain  from  using  PLUL,  only  abouf 
half  of  fhe  gains  are  sfafisfically  significanf.  To  explain,  consider  fhaf  PLUL  works  besf  when  fhe 


2821 


McDowell,  Gupta  and  Aha 


Fewer  attributes  (default) 

More  attributes  (100) 

With  PLUL? 

PLUL 

With  PLUL? 

PLUL 

Yes 

No 

Gain 

Yes 

No 

Gain 

ICAc 

56.8 

56.1 

0.7 

68.6 

68.1 

0.5 

ICA 

54.5 

52.3 

2.2 

65.7 

64.9 

0.8 

Gibbs 

53.5 

50.1 

3.4 

67.0 

66.1 

0.9 

GibbsNC 

55.5 

53.0 

2.5 

66.5 

65.6 

0.9 

LBP 

49.9 

44.3 

5.6 

65.2 

58.4 

6.8 

LBPnc 

46.0 

42.1 

3.9 

63.5 

56.4 

7.1 

Table  1 1 :  Accuracy  results  showing  the  impact  of  using  PLUL  with  the  real  data.  Each  value  shows 
results  pooled  over  the  six  real  data  sets,  at  lp=0%,  using  NB  where  applicable.  Gains  in 
bold  are  statistically  significant. 


holdout  set  used  for  learning  is  most  similar  to  the  test  set.  With  the  synthetic  data,  such  similarity 
is  likely,  because  the  two  graphs  are  generated  from  the  same  distribution.  However,  with  the  real 
data,  splitting  an  arbitrary  graph  into  multiple  subgraphs,  even  while  seeking  to  maintain  similar 
class  distributions,  may  nonetheless  produce  subgraphs  with  important  differences  (e.g.,  in  auto¬ 
correlation),  leading  to  sub-optimal  parameter  choices  by  PLUL.  Luture  work  is  needed  to  explore 
these  issues. 

Nonetheless,  the  evidence  suggests  that  in  most  cases  for  the  real  and  synthetic  data  PLUL 
improves  performance.  Moreover,  for  every  algorithm  there  was  some  type  of  data  for  which  not 
using  PLUL  led  to  very  poor  performance.  Thus,  applying  PLUL  in  all  of  our  other  experiments 
seemed  advisable  for  maximizing  performance  and  for  ensuring  the  most  equitable  comparisons. 

7.8  Choice  of  Relational  Feature  Types 

Section  6.6  described  how  each  trial  selected  a  type  of  relational  feature  to  use.  Lor  completeness, 
Table  12  summarizes  how  often  each  type  of  feature  was  chosen.  In  general,  the  best  feature  type  (as 
chosen  by  cross-validation)  varied  based  on  the  local  classifier  used  and  fhe  dafa  conditions.  How¬ 
ever,  Table  12  shows  fhaf  for  NB,  mulfisef  fealures  were  dominanf,  especially  for  fhe  more  caufious 
algorifhms  (chosen  76-96%  of  fhe  time  for  ICAc  and  Gibbs).  Wifh  kNN,  proporfion  fealures  were 
dominanf,  while  wifh  LR  counl  fealures  were  chosen  mosl  often  bul  proportion  fealures  were  also 
fairly  common,  especially  wifh  high  Id.  These  resulls  suggesl  lhal  an  analysl  should  mosl  likely  use 
mullisel  wilh  NB,  use  proportion  wilh  kNN,  and  consider  Ihe  dala  conditions  lo  selecl  a  fealure  lype 
for  LR. 

The  superiority  of  mullisel  fealures,  when  Ihey  were  applicable,  is  interesting  because  Ihey  are 
“cautious”  fealures  lhal  simply  ignore  nodes  wilh  no  known  or  predicted  label  (see  Section  6.6.1). 
Likewise,  Section  6.5  reported  lhal  LR  wilh  ICAc  performed  besl  when  missing  fealure  values 
were  ignored  (by  using  a  separate  classifier  Irained  wilhoul  Ihe  missing  fealures).  These  resulls 
are  consislenl  wilh  Saar-Tsechansky  and  Provosl  (2007),  who  found  (for  non-relational  dala)  Ihis 
“reduced-fealure  model”  approach  lo  be  superior  lo  commonly  used  approaches  based  on  impula- 
lion.  Lor  a  non-relational  selling,  Iheir  resulls  Ihus  demonslrale  Ihe  superiority  of  a  more  “cautious” 
approach  lo  handling  missing  values  during  testing.  Lor  relational  domains,  we  could  imagine  fak¬ 
ing  Ihis  idea  of  ignoring  missing/eslimaled  values  even  furlher,  e.g.,  using  a  classifier  lhal  ignored 


2822 


Cautious  Collective  Classieication 


Mult. 

A.)  ICAc 

Count 

Prop. 

Mult. 

B.)  ICA 

Count 

Prop. 

Mult. 

C.)  Gibbs 

Count 

Prop. 

Synthetic  data,  using  the  NB  local  classifier 

Default 

96% 

0% 

4% 

72% 

0% 

28% 

100% 

0% 

0% 

Low  ap 

88% 

0% 

12% 

20% 

4% 

76% 

92% 

0% 

8% 

High  Id 

48% 

0% 

52% 

80% 

0% 

20% 

96% 

0% 

4% 

Average 

77% 

0% 

23% 

57% 

1% 

41% 

96% 

0% 

4% 

Synthetic  data,  using  the  LR  local  classifier 

Default 

n.a. 

92% 

8% 

n.a. 

80% 

20% 

n.a. 

80% 

20% 

Low  ap 

n.a. 

52% 

48% 

n.a. 

60% 

40% 

n.a. 

68% 

32% 

High  Id 

n.a. 

80% 

20% 

n.a. 

52% 

48% 

n.a. 

48% 

52% 

Average 

n.a. 

75% 

25% 

n.a. 

64% 

36% 

n.a. 

65% 

35% 

Synthetic  data,  using  the  kNN  local  classifier 

Default 

n.a. 

0% 

100% 

n.a. 

0% 

100% 

n.a. 

0% 

100% 

Low  ap 

n.a. 

0% 

100% 

n.a. 

12% 

88% 

n.a. 

0% 

100% 

High  Id 

n.a. 

0% 

100% 

n.a. 

0% 

100% 

n.a. 

0% 

100% 

Average 

n.a. 

0% 

100% 

n.a. 

4% 

96% 

n.a. 

0% 

100% 

Real  data,  using  the  NB  local  classifier 

Cora 

97.5% 

2.5% 

0.0% 

70.0% 

17.5% 

12.5% 

100.0% 

0.0% 

0.0% 

Citeseer 

92.5% 

2.5% 

5.0% 

57.5% 

32.5% 

10.0% 

100.0% 

0.0% 

0.0% 

WebKB-rco 

84.4% 

0.0% 

15.6% 

65.6% 

34.4% 

0.0% 

71.9% 

12.5% 

15.6% 

WebKB 

53.1% 

40.6% 

6.3% 

31.3% 

56.3% 

12.5% 

75.0% 

21.9% 

3.1% 

HepTH 

85.0% 

12.5% 

2.5% 

62.5% 

25.0% 

12.5% 

70.0% 

27.5% 

2.5% 

Terror 

50.0% 

8.3% 

41.7% 

50.0% 

25.0% 

25.0% 

41.7% 

16.7% 

41.7% 

Average 

77.1% 

11.1% 

11.8% 

56.1% 

31.8% 

12.1% 

76.4% 

13.1% 

10.5% 

Table  12:  The  relational  feature  type  (multiset,  eount,  or  proportion)  ehosen  by  eross-validation. 

For  the  synthetie  data,  results  are  shown  with  the  default  settings,  with  low  attribute 
predietiveness  (ap=0.3),  and  with  high  link  density  {ld=0J).  For  the  real  data,  results  are 
shown  averaged  aeross  all  the  data  points  shown  in  Figures  10  and  11. 


the  estimated  label  of  a  linked  node  but  instead  direetly  used  its  non-relational  features.  However, 
Jensen  et  al.  (2004)  demonstrated  that  sueh  an  approaeh  is  generally  inferior  to  the  approaehes  we 
eonsider  in  this  artiele  (label-based  features  with  eolleetive  inferenee),  beeause  of  the  mueh  larger 
number  of  model  parameters  that  must  be  learned  for  the  former  ease. 

7.9  Variants  of  wvRN 

Most  prior  researeh  involving  wvRN  has  used  wvRNrl,  the  variant  suggested  as  a  relational-only 
baseline  by  Maeskassy  and  Provost  (2007).  However,  algorithms  based  on  wvRN  need  not  neeessar- 
ily  be  relational-only.  For  instanee,  Maeskassy  (2007)  deseribed  a  teehnique  for  adding  additional 
links  to  the  graph  between  nodes  that  appeared  similar  based  on  their  attributes.  Alternatively,  we 
eould  imagine,  for  wvRNrl,  initializing  eaeh  node’s  predieted  label  probabilities  based  upon  the 
output  of  an  attribute-only  loeal  elassifier  (instead  of  using  elass  priors  as  done  in  Figure  5).  Unfor¬ 
tunately,  this  idea  does  not  work  well  for  a  “soft”  algorithm  sueh  as  wvRNrl,  beeause  after  iterating 
many  times  the  eurrent  state  is  almost  eompletely  determined  by  the  known  labels,  independent 


2823 


McDowell,  Gupta  and  Aha 


Degree  of  homophily  (dh) 


Degree  of  homophily  (dh) 


Degree  of  homophily  (dh) 


Attribute  predictiveness  (ap)  Attribute  predictiveness  (ap) 


Attribute  predictiveness  (ap) 


Link  density  (Id) 


Link  density  (Id) 


Link  density  (Id) 


Figure  15:  Results  for  the  synthetic  data  where  wvRNseed  is  added  for  comparison.  Because  of  the 
multiple  possible  comparisons,  filled  triangles  are  not  used  here  to  indicate  statistical 
significance. 


of  fhe  sfarfing  sfafe  (Macskassy  and  Provosf,  2005).  While  in  principle  fhis  problem  could  be  ad¬ 
dressed  via  learning  an  appropriafe  decay  paramefer  F  and  slopping  poinl,  fhis  forfeils  much  of  fhe 
simplicily  of  wvRN. 

In  confrasf  lo  wvRNrl,  wilh  a  hard-labeling  algorilhm  such  as  wvRNjca+c-:  the  inilial  condilions 
do  mailer.  In  particular,  we  evaluated  wvRNseed,  an  algorilhm  lhal  behaves  jusl  like  wvRNjca+c,  ex- 
cepl  lhal  each  node’s  predicted  label  is  initialized  lo  Ihe  mosl  likely  label  predicted  by  an  allribule- 
only  NB  classifier.  Non-relational  information  Ihus  “seeds”  Ihe  inference  process  bul  is  Ihen  nol 
explicilly  used  again.  To  Ihe  besl  of  our  knowledge,  Ihis  algorilhm  has  nol  been  previously  consid¬ 
ered  for  CC. 

Figure  15  shows  a  variety  of  resulls  for  Ihe  synlhelic  dala;  resulls  wilh  Ihe  real  dala  showed 
similar  Irends.  Overall,  wvRNseed  oulperforms  wvRNrl  (especially  when  Ip  is  low),  which  is  lo  be 
expected  since  wvRNseed  uses  more  information.  wvRNseed  generally  underperforms  ICAc,  which 


2824 


Cautious  Collective  Classieication 


is  also  to  be  expected  since  ICAc  both  uses  predicted  labels  cautiously  (while  wvRNseed  treats  all 
predictions  equally)  and  continues  to  use  both  attribute  and  relational  information  after  the  first  iter¬ 
ation.  The  differences  with  ICAc  are  largest  when  dh  is  low  (where  wvRNA  homophily  assumption 
is  violated)  or  when  attribute  predictiveness  is  high  (since  wvRNseed  uses  the  attributes  only  at  ini¬ 
tialization).  However,  wvRNseed  outperforms  all  of  the  other  shown  algorithms  when  link  density  is 
high.  This  case  is  analogous  to  the  results  with  ICA^n  from  Section  7.5:  if  accuracy  and  link  density 
are  high  (and  homophily  is  present),  then  caution  with  relational  information  may  not  be  necessary, 
and  this  case  shows  that  continuing  to  use  non-relational  information  after  initialization  may  also 
not  be  necessary.  Overall,  the  results  indicate  that  wvRNseed  is  not  likely  to  be  a  strong  contender 
as  a  general  purpose  CC  algorithm,  but  they  do  demonstrate  an  effective  way  to  add  non-relational 
information  to  wv/?A^-based  algorithms. 

7.10  Impact  of  the  Default  Values  for  Synthetic  Data  Generation 

The  synthetic  data  evaluated  above  was  generated  with  the  default  parameters  described  in  Table  4. 
Conceivably,  our  choice  of  default  values  could  have  an  important  effect  on  the  results.  While  our 
evaluation  of  multiple  real  data  sets  has  already  helped  to  validate  the  synthetic  data  results,  we  also 
carried  out  an  extensive  exploration  with  other  default  values.  For  instance,  when  varying  Id,  we 
experimented  with  all  combinations  of  ap=  {0.4,  0.6,  0.8},  dh=  {0.5,  0.7,  0.9},  and  Zp={0%,  10%, 
50%}.  For  tractability,  we  only  evaluated  variants  of  1C  A,  since  the  above  results  show  that  1C  Ac 
produced  the  best  or  nearly  the  best  results  for  all  synthetic  and  real  data  sets,  and  that  other  cautious 
algorithms  usually  behaved  like  ICAc- 

The  trends  were  highly  consistent  with  the  results  we  report  and  agree  with  our  accepted  hy¬ 
potheses.  For  instance,  if  the  default  ap  is  very  high,  the  results  for  varying  dh  showed  a  much 
smaller  slope  for  the  relative  impact  of  cautious  ICAc  vs  ICA.  The  only  default  value  that  notice¬ 
ably  changed  any  result  was  already  reported  in  Section  7.5:  when  ap  was  small  (e.g.,  0.4),  the 
unusual  advantage  of  ICA  over  ICAc  observed  at  very  high  Id  disappeared.  Thus,  we  believe  the 
trends  in  our  results  are  robust  over  a  wide  range  of  data  characteristics. 

8.  Discussion 

In  this  section  we  compare  results  with  different  families  of  algorithms,  examine  the  overall  effec¬ 
tiveness  of  caution,  and  use  our  results  to  explain  the  findings  of  some  previous  research. 

8.1  Comparisons  Across  Algorithmic  Families 

Section  7  focused  on  comparing  cautious  vs.  non-cautious  variants  within  the  same  algorithmic 
family.  We  now  briefly  compare  across  these  families.  We  focus  on  the  algorithms  that  have  been 
most  frequently  used  in  previous  work:  ICA,  Gibbs,  LBP,  and  wvRNrl-  We  also  include  the  less 
studied  ICAc,  since  our  results  show  that  it  has  very  strong  performance.  We  report  specific  results 
for  Zp=10%;  comparisons  were  similar  for  lp=0%,  while  all  of  the  algorithms  perform  very  similarly 
when  lp=50%. 

wvRNrlS  performance  depends  on  homophily,  link  density,  and  Ip.  In  our  study,  wvRNrl  was 
thus  competitive  with  the  other  CC  algorithms  when  homophily  and/or  Ip  was  high,  or  when  the 
attributes  were  not  very  predictive.  On  the  other  hand,  wvRNrl  requires  that  some  labels  are  known 


2825 


McDowell,  Gupta  and  Aha 


in  the  test  set,  so  it  is  not  applieable  when  lp=0%  (the  out-of-sample  task).  wvRNseed  would  be  an 
alternative. 

For  the  synthetie  data,  the  eautious  algorithms  ICAc,  Gibbs,  and  LBP  had  remarkably  similar 
performanee.  Among  the  three,  Gibbs  had  a  small  but  sometimes  signiheant  performanee  advan¬ 
tage.  For  instanee,  aeross  the  results  for  varying  dh  at  Zp=10%  shown  in  Figure  8,  Gibbs  outper¬ 
formed  ICAc  by  an  average  of  1.0%  (signifieantly  for  dh>0.6)  and  LBP  by  an  average  of  0.7% 
(signifieantly  for  0A<dh<0J).  Neither  ICAc  nor  LBP  had  eonsistent,  signiheant  gains  over  the 
other,  exeept  that  both  Gibbs  and  ICAc  had  substantial,  signiheant  gains  over  LBP  when  attribute 
strength  was  very  low  (gains  of  5-8%)  or  when  link  density  was  high  (gains  of  14-25%).  However, 
all  three  algorithms  did  have  substantial,  signiheant  gains  vs.  ICA,  exeept  for  when  dh  was  very  low 
or  when  Id  was  very  high.  For  instanee,  aeross  the  various  dh  levels,  Gibbs  outperformed  ICA  by 
0.9-11.2%  (all  signiheantly)  exeept  for  a  loss  of  0.1%  at  dh=0.1.  Thus,  based  on  the  synthetie  data 
results,  ICAc,  Gibbs,  and  LBP  usually  aehieve  similar  aeeuraeies,  despite  their  use  of  very  different 
approaehes  to  eaution. 

On  the  real  data  sets,  ICAc,  Gibbs,  and  LBP  likewise  performed  similarly.  However,  there  are 
two  kinds  of  differenees  that  should  be  noted.  First,  there  were  a  few  data  sets  on  whieh  LBP  and/or 
Gibbs  performed  notieeably  worse  than  ICAc-  In  partieular,  Gibbs  has  poor  performanee  on  HepTH 
and  WebKB-i-eo.  In  both  eases,  this  is  likely  due  to  issues  of  high  link  density  (WebKB  has  very 
many  eo-eitation  links;  HepTH  has  fewer  links  but  some  nodes  have  very  high  degree).  High  link 
density  ean  lead  to  extreme  probabilities,  where  Gibbs  is  known  to  perform  poorly.  While  this 
was  not  a  partieular  problem  with  the  synthetie  data  (perhaps  beeause  the  training  and  test  graphs 
were  more  similar),  NB  is  well  known  for  produeing  polarized  probabilities  in  some  eases.  PLUL 
does  help,  for  instanee,  improving  performanee  on  HepTH  and  WebKB-i-eo  by  an  average  of  4% 
and  15%,  respeetively,  in  Figures  10  and  11.  Nonetheless,  performanee  with  Gibbs  lags  that  of 
ICAc  or  ICA,  whieh  are  not  so  inliueneed  by  extreme  probabilities.  We  experimented  with  more 
and/or  longer  Gibbs  ehains  but  this  did  not  improve  performanee.  However,  this  is  one  ease  where 
the  LR  elassifier  performed  better  than  NB:  it  appears  to  produee  less  polarized  probabilities  than 
NB,  leading  to  improved  performanee  with  Gibbs  (see  Figures  24  and  27  in  the  online  appendix). 
Similarly,  LBP,  whieh  struggles  with  high  link  density,  also  has  problems  with  HepTH  (and  likely 
would  have  low  performanee  with  WebKB-i-eo,  had  it  ever  eonverged)  and  with  Cora.  Its  diffieulty 
with  Cora  is  surprising  and  possibly  indieates  that  the  eonjugate  gradient  training  did  not  perform 
adequately,  despite  our  attempts  (ef.,  Sen  et  ah,  2008).  However,  LBP  did  perform  well  on  Citeseer, 
whieh  has  similar  eharaeteristies. 

Seeond,  in  eontrast  to  the  small  advantage  for  Gibbs  on  the  synthetie  data,  for  the  real  data  ICAc 
holds  a  small  advantage.  For  instanee,  in  Figure  10,  ICAc  outperforms  Gibbs  on  average  by  1%  for 
Cora  and  2.4%  for  Citeseer,  though  not  signifieantly.  For  HepTH  and  WebKB-i-eo,  where  Gibbs  had 
trouble,  the  gains  averaged  5.4%  and  21.0%,  respeetively,  and  were  signiheant  for  HepTH  when 
the  number  of  attributes  was  small.  ICAc  was  also  robust:  it  was  the  only  algorithm  to  outperform 
ICA  on  average  for  every  real  data  set  eonsidered.  Moreover,  using  results  pooled  over  all  six  data 
sets,  ICAc  had  moderate  gains  vs.  ICA,  Gibbs,  LBP,  and  wvRNrl,  both  at  the  default  number  of 
attributes  (where  the  gains  were  signiheant)  and  using  100  attributes  for  eaeh  data  set.  Comparing 
to  just  Gibbs  and  LBP,  ICAc  had  a  pooled  gain  of  4.9%  and  7.8%,  respeetively,  with  the  default 
number  of  attributes,  and  1.8%  and  4.5%,  respeetively,  with  100  attributes. 


2826 


Cautious  Collective  Classieication 


8.2  Cautious  Behavior  as  a  Predictor  of  Performance 

The  previous  seetion  identified  some  of  the  situations  in  whieh  the  algorithms  performed  similarly 
or  differently.  However,  if  we  exelude  the  extreme  data  eonditions  sueh  as  very  low  attribute  predie- 
tiveness  or  high  link  density,  a  more  remarkable  finding  emerges:  the  amount  of  cautious  inference 
used  by  an  algorithm  strongly  predicts  its  relative  performance.  This  finding  is  espeeially  inter¬ 
esting  beeause  the  preeise  type  of  eautious  inferenee  seems  to  matter  little.  On  both  the  synthetie 
and  the  real  data  sets,  in  most  eases  ICAc,  Gibbs,  and  LBP  perform  alike,  while  the  non-eautious 
ICA,  GibbsNC,  and  LBPnc  also  perform  similarly  to  eaeh  other  (and  at  lower  aeeuraey  levels  than 
the  eautious  algorithms).  However,  when  many  test  labels  are  known  (high  Ip),  the  need  for  eaution 
deereases,  and  the  differenees  between  these  two  groups  greatly  diminish. 

This  effeet  ean  also  be  seen  in  other  CC  variants.  For  instanee,  wvRNrl  and  wvRNjca+c  perform 
similarly,  despite  their  very  different  approaehes  to  eaution,  and  they  both  outperform  the  non- 
eautious  wvRNica+nc-  Likewise,  in  almost  every  ease  the  somewhat-eautious  ICAkh  attained  an 
aeeuraey  between  that  of  the  more  eautions  ICAc  and  the  non-eautious  ICA. 

Thus,  the  amount  of  eautious  inferenee  seems  to  be  the  biggest  faetor  differentiating  those  algo¬ 
rithms  that  use  attributes,  mueh  more  so  than  whether  some  kind  of  ICA  or  Gibbs  or  LBP  is  used. 
Likewise,  when  attributes  are  not  used,  as  with  the  variants  of  wvRN,  eaution  also  appears  to  be  the 
largest  faetor  in  predieting  relative  performanee. 

8.3  Limitations  of  Cautious  Inference 

While  our  results  show  that  the  eautious  use  of  relational  information  ean  signifieantly  boost  perfor¬ 
manee,  adding  more  eaution  to  an  algorithm  is  not  always  benefieial.  In  partieular,  the  most  extreme 
form  of  relational  eaution  is  to  not  use  any  relational  information  (i.e.,  CO),  but  that  is  seldom  op¬ 
timal.  Instead,  an  algorithm  must  seek  to  eautiously  avoid  errors  from  noisy  predietions  while  still 
leveraging  informative  relations. 

To  illustrate  these  effeets.  Figure  16  shows  aeeuraey  results  for  three  synthetie  data  eonditions: 
low  attribute  predietiveness  {ap=0.3),  the  default  settings,  and  high  link  density  {ld=0.9).  Here  the  x- 
axis  indieates  the  algorithm  used,  with  the  amount  of  relational  eaution  used  inereasing  to  the  right. 
We  foeus  on  variants  of  ICA,  but  add  three  new  algorithms  for  further  analysis.  ICA-jq  is  just  like 
ICAc,  exeept  that  it  stops  after  it  has  “eommitted”  and  used  the  most  eertain  70%  of  the  predieted 
labels  (i.e.,  after  the  iteration  when  h  =  l  m  Figure  2).  ICA^q  and  ICAq  likewise  stop  after  aeeepting 
and  using  30%  and  0%  of  the  predieted  labels,  respeetively.  Note  that  ICAq  is  identieal  to  ICAxn 
during  the  very  first  iteration  (when  both  use  only  the  “known”  labels  for  relational  features),  but 
that  ICAq  stops  after  that  iteration,  while  ICArh  eontinues  for  10  more  iterations,  using  all  available 
predietions  during  those  iterations. 

For  the  default  and  low  attribute  predietiveness  data  eonditions,  the  trends  are  very  similar: 
amongst  ICA,  ICArh,  and  ICAc,  the  most  eautious  ICAc  performs  best.  Adding  more  eaution  to 
ICAc,  however,  eonsistently  deereases  performanee,  as  ICA-jq,  ICA^q,  and  ICAq  use  less  and  less 
relational  information,  until  the  lowest  performanee  is  found  with  the  non-relational  CO.  These 
results  make  sense:  for  this  data,  relational  links  are  informative,  so  eompletely  ignoring  any  (or 
all)  of  them  is  non-optimal.  Indeed,  using  all  of  them  without  any  eaution  (ICA)  is  mueh  better  than 
eautiously  ignoring  all  relations  (CO),  but  the  eautious  algorithm  that  eventually  uses  all  relations 
(ICAc)  performs  best.  Note  that  this  property  of  (eventually)  using  all  available  relational  informa- 


2827 


McDowell,  Gupta  and  Aha 


(less  cautious)  (more  cautious) 

Figure  16:  Accuracy  as  a  function  of  the  amount  of  relational  caution  used.  ICA-jq,  ICA^q,  and 
ICAq  are  (even  more  cautious)  variants  of  ICAc  that  stop  iterating  before  some  of  the 
less  certain  relational  information  has  been  used. 


tion  is  true  of  all  of  the  more  cautious  algorithms  that  we  considered  in  this  article  {ICAc,  Gibbs, 
LBP,  wvRNrl,  and  wvRNica+c)- 

The  high  link  density  case  provides  an  interesting  contrast.  Here  the  general  shape  of  the  curve 
is  similar,  but  the  peak  performance  is  observed  with  ICAxn,  not  with  the  more  cautious  ICAq.  This 
effect  was  already  discussed  in  Section  7.5:  if  the  baseline  accuracy  is  high  and  there  are  many 
links,  simply  using  all  available  information  after  the  first  iteration  is  best.  Similarly,  for  situations 
where  caution  is  not  very  important  (e.g.,  when  Ip  is  high),  the  curve  would  show  similar  results  for 
ICA,  ICAkh,  and  ICAc-  Thus,  in  most  cases  being  cautious  with  relational  information  is  best,  but 
the  algorithm  should  eventually  use  all  available  information  (relational  and  non-relational),  and  in 
some  cases  using  more  caution  may  be  less  important  or  even  harmful. 

8.4  Explanation  of  Prior  Results 

Our  investigation  enables  us  to  explain  the  questions  from  Section  1,  among  others: 

1.  Why  did  Sen  et  al.  (2008)  find  no  consistent  difference  between  Gibbs  and  /CA?  In  con¬ 
trast,  Gibbs  had  worked  well  in  other  work,  and  in  this  article  we  found  that  Gibbs  (and 
ICAc)  often  significantly  increases  accuracy  vs.  ICA.  However,  our  results  and  careful  study 
of  Sen  et  al.’s  methodology  explains  the  discrepancy:  to  generate  the  test  set,  they  used  a 
snowball  sampling  method  that  we  found  produces  an  effective  labeled  proportion  {Ip)  of  at 
least  0.5 — a  region  where  the  use  of  caution  has  little  impact.  Also,  their  study  did  not  vary 
attribute  predictiveness,  which  we  show  is  a  significant  factor  in  the  relative  performance  of 
more  cautious  CC  algorithms. 

2.  Why  did  McDowell  et  al.  (2007a)  find  that  ICAc  significantly  outperforms  Gibbs,  even 
though  attribute  predictiveness  was  high,  while  here  we  find  that  Gibbs  performs  on 


2828 


Cautious  Collective  Classieication 


par  or  better  than  ICAc  in  such  cases?  To  investigate,  we  re-ran  our  experiments  from  our 
earlier  paper,  but  with  two  variations  informed  by  our  now-refined  understanding  of  CC.  First, 
we  used  PLUL  with  both  the  NB  and  kNN  elassifiers.  Seeond,  we  ehanged  the  NB  elassifier 
to  use  multiset  relational  features  (instead  of  proportion),  whieh  use  more  information  and 
which  Section  7.8  shows  is  the  feature  of  choice  when  using  NB  (it  didn’t  apply  for  kNN). 
With  these  enhancements,  Gibbs’s  relative  performance  improved,  so  that  ICAc  and  Gibbs 
both  significantly  outperformed  ICA,  but  the  results  for  Gibbs  and  1C  Ac  did  not  significantly 
differ.  Thus,  more  careful  learning  and  representation  choices  resolves  the  discrepancy.  This 
also  suggests  that  not  using  PLUL  could  potentially  have  an  important  effect  on  performance 
comparisons.  As  an  additional  example.  Sen  and  Getoor  (2006)  experimented  with  a  wide 
range  of  link  densities  but  did  not  use  a  technique  like  PLUL;  our  results  suggest  that  using 
PLUL  could  have  significantly  improved  their  results  with  LBP  for  high  Id. 

3.  Why  did  Galstyan  and  Cohen  (2007)  find  that  a  soft- labeling  version  of  wvRN  fails  to 
consistently  outperform  a  hard  “label  propagation”  (LP)  version?  Most  authors  have  ex¬ 
pected  that,  for  relational-only  classification,  the  soft-labeling  algorithm  that  directly  reasons 
with  probabilities  (thus  exercising  cautious  inference)  should  outperform  a  hard-labeling  ver¬ 
sion  that  only  reasons  with  the  single  most  likely  label  for  each  linked  node.  However,  closer 
examination  of  their  LP  algorithm  reveals  that  it  includes  elements  of  caution.  In  particular, 
after  each  iteration,  LP  labels  a  non-known  node  only  if  the  estimated  score  for  that  node  is 
among  the  highest  of  any  such  nodes.  Thus,  in  a  way  similar  to  wvRNica+c^  nodes  that  are 
closest  to  known  nodes  are  labeled  first,  and  the  algorithm  effectively  favors  label  information 
that  was  either  known  or  is  closer  to  other  known  nodes.  This  cautious  behavior  enables  LP 
to  be  competitive  with  (and  sometimes  outperform)  the  soft-labeling  algorithm. 

4.  Why  did  Sen  et  al.  (2008)  find  that  ICA  and  Gibbs  perform  better  with  LR  than  with  NB, 
while  we  find  the  reverse?  We  replicated  the  synthetic  data  of  their  paper,  and  reproduced 
their  results.  A  key  point,  however,  is  that  Sen  et  al.  used  count  relational  features  for  both 
NB  and  LR,  while  we  used  cross-validation  on  a  holdout  set  to  select  the  best  relational 
feature  type  (see  Section  6.6).  This  procedure  predominantly  selected  multiset  features  for 
NB  (see  Section  7.8),  which  we  found  in  separate  experiments  to  consistently  improve  NB 
performance  compared  to  using  count  features.  Consequently,  in  our  results  CC  algorithms 
that  use  NB  almost  always  outperformed  those  that  use  LR.  While  not  a  focus  of  our  work, 
such  differences  can  be  seen  in  Table  10.  The  superior  performance  of  multiset  features  also 
confirms  fhe  finding  of  Neville  el  al.  (2003b). 

5.  When  will  cautious  algorithms  outperform  their  aggressive  variants?  We  found  that  us¬ 
ing  more  cautious  CC  frequently  and  sometimes  dramatically  increased  accuracy.  In  gen¬ 
eral,  cautious  CC  performs  comparatively  well  whenever  relational  inference  errors  are  more 
likely.  These  errors  occur  more  frequently  when  there  is  more  uncertainty  in  the  estimated 
relational  feature  values  (e.g.,  when  the  attribute  predictiveness  is  low)  or  when  the  effect  of 
any  such  uncertainty  is  magnified  (e.g.,  when  autocorrelation  is  high).  In  some  cases,  such  as 
when  the  test  set  links  to  many  known  labels  (high  Ip),  using  a  more  cautious  CC  algorithm 
may  be  unnecessary.  However,  in  many  cases  (and  with  most  previous  work)  Ip  is  small  or 
zero,  and  thus  caution  may  be  important. 


2829 


McDowell,  Gupta  and  Aha 


9.  Conclusion 

Collective  classification’s  greatest  strength — making  inferences  based  on  the  inferred  labels  of  re¬ 
lated  nodes — can  also  be  a  significant  weakness,  since  this  use  of  uncertain  labels  may  reduce  ac¬ 
curacy  when  the  estimates  are  incorrect.  In  this  article,  we  demonstrated  that  managing  this  estima¬ 
tion  uncertainty  through  “cautious”  algorithmic  behavior  is  essential  to  achieving  maximal,  robust 
performance.  We  showed  how  varying  degrees  of  cautious  inference  could  be  manifested  in  four 
different  collective  inference  families,  and  explained  how  to  use  cautious  learning  with  PLUL  to 
further  improve  performance.  Our  experimental  results  with  both  synthetic  and  real-world  data  sets 
showed  that  cautious  algorithms  did  outperform  their  non-cautious  variants.  By  exploring  a  wide 
range  of  data,  we  identified  some  data  characteristics  for  which  this  performance  advantage  grew 
larger.  In  particular,  cautious  behavior  is  especially  important  when  there  is  a  higher  probability  of 
incorrect  relational  inference — which  occurs  when  autocorrelation  is  higher,  when  link  density  is 
moderate,  and/or  when  attribute  predictiveness  or  the  labeled  proportion  is  lower.  In  addition,  our 
study  enabled  us  to  answer  several  important  questions  from  previous  work. 

Across  a  wide  range  of  data,  we  found  that  an  algorithm’s  degree  of  caution  was  a  significant 
predictor  of  relative  performance — in  most  cases  a  more  important  one  than  the  specific  collective 
inference  algorithm  used.  This  reinforces  the  fundamental  importance  of  cautious  behavior  for  CC. 
However,  the  cautious  CC  algorithms  were  not  always  comparable.  Gibbs  and  (especially)  LBP 
sometimes  struggled  (e.g.,  when  the  data  had  high  link  density).  In  contrast,  ICAc  was  a  very 
reliable  performer  and  almost  always  had  maximal  or  near-maximal  performance,  especially  for  the 
real-world  data.  This  finding  is  interesting  because  this  article  is  the  first  to  consider  ICAc  in  depth. 
Moreover,  ICAc  is  a  simple  modification  to  ICA,  making  it  much  more  time-efficient  than  Gibbs  or 
LBP.  This  suggests  that  ICAc  is  a  strong  contender  for  general  CC  tasks,  and  should  be  used  as  a 
baseline  for  future  CC  performance  comparisons. 

Regarding  cautious  learning,  we  found  that  PLUL  generally  increased  accuracy,  sometimes 
substantially.  Parameter  tuning  is  known  to  be  important  for  learning  non-relational  classifiers. 
We  show  that  it  can  be  especially  critical  for  CC  due  to  CC’s  reliance  on  uncertain  labels  during 
testing.  For  example,  further  results  showed  that  for  the  synthetic  data  when  link  density  was  high, 
Gibbs+NB  with  a  naive  a  (prior  hyperparameter)  of  1.0  attained  99%  of  the  accuracy  attainable 
with  any  a — if  most  test  labels  were  known  (e.g.,  lp=W%).  However,  when  lp=0%  this  strategy’s 
accuracy  was  just  61%  of  optimal.  Using  PLUL  to  set  a  instead  increased  accuracy.  In  addition,  our 
results  in  Section  7.7  showed  PLUL  helping  both  cautious  and  non-cautious  inference  algorithms. 
Thus,  using  PLUL  for  cautious  learning  improves  performance,  and  adding  cautious  inference  helps 
even  more. 

Future  work  is  needed  to  compare  the  algorithms  considered  here  with  alternative  methods, 
such  as  Markov  Logic  Networks  (Richardson  and  Domingos,  2006)  and  the  “ghost  edge”  approach 
of  Gallagher  et  al.  (2008),  and  to  compare  PLUL  to  the  alternative  “stacked  models”  discussed  in 
Section  5.5.  In  addition,  further  studies  to  consider  the  effect  of  training  set  size,  noise  in  the  known 
labels,  and  link  uncertainty  would  be  useful.  Finally,  techniques  are  needed  to  further  improve  the 
performance  of  cautious  inference  on  data  with  high  link  density  or  other  extreme  conditions. 


2830 


Cautious  Collective  Classieication 


Acknowledgments 

Thanks  to  Doug  Downey,  Lise  Getoor,  David  Jensen,  and  Sofus  Macskassy  for  helpful  comments 
on  this  work,  to  Prithviraj  Sen  for  the  Cora  and  Citeseer  data  sets,  to  Jennifer  Neville  for  helpful 
discussions  and  for  code  that  implements  LBP,  and  to  Prithviraj  Sen  and  Mustafa  Bilgic  for  clari¬ 
fications  on  their  work.  Thanks  also  to  the  anonymous  reviewers  for  many  helpful  comments  that 
helped  to  improve  this  article.  Luke  McDowell’s  funding  for  this  research  was  partly  supported 
by  the  U.S.  Naval  Academy  Cooperative  Program  for  Scientific  Inferchange,  which  is  a  compo- 
nenf  of  NRL’s  General  Laboratory  Scientific  Inferchange  Program.  Porfions  of  fhis  analysis  were 
conducted  using  Proximify,  an  open-source  soflware  environmenf  developed  by  fhe  Knowledge  Dis¬ 
covery  Laborafory  af  fhe  University  of  Massachusetts  Ambers!  (hffp://kdl.cs.umass.edu/proximify/). 
The  HepTH  dafa  was  derived  from  fhe  Proximify  HEP-Th  dafabase,  which  is  based  on  dafa  from  fhe 
arXiv  archive  and  fhe  Sfanford  Linear  Accelerator  Center  SPIRES-HEP  dafabase  provided  for  fhe 
2003  KDD  Cup  compefifion,  wifh  addifional  preparation  performed  by  fhe  Knowledge  Discovery 
Eab. 

Appendix  A.  Measuring  the  Strength  of  Relational  Dependence 

Dafa  sefs  used  for  CC  are  offen  measured  for  fheir  aufocorrelafion.  Alfernafively,  label  consistency 
is  fhe  percenfage  of  links  connecting  nodes  wifh  fhe  same  label.  A  closely  relafed  measure  is  fhe 
degree  ofhomophily  (dh)  used  by  Sen  el  al.  (2008).  To  see  fhe  difference,  suppose  lhal  a  dafa  sel 
has  five  labels  lhal  occur  wifh  equal  frequency.  Sen  el  al.  argue  lhal,  if  dh  is  zero,  fhe  largel  of  a  link 
from  a  node  labeled  A  should  be  lo  anolher  node  labeled  A  20%  of  fhe  time  (random  chance),  nol 
0%  of  fhe  lime  (Sen,  2008).  Thus,  for  a  uniform  class  dislribulion,  fhe  aclual  probabilily  of  a  link 
connecling  Iwo  nodes  i  and  j  of  fhe  same  label  is  defined  as: 


label  consistency  =  P(y,  =yj\{i,j)  ^E)  =dh  + 


\  —  dh 


(4) 


To  facililafe  comparison,  we  adopl  Ibis  definition  lo  generate  synlhelic  dafa  wifh  varying  levels 
of  dh.  However,  for  real  dafa  sefs,  we  can  only  direclly  compute  label  consislency.  Thus,  to  facili- 
fafe  comparison  we  also  compule  approximate  homophily  from  fhe  measured  label  consislency  by 
assuming  a  uniform  dislribulion  of  labels  and  solving  for  dh  using  Equalion  4. 


Appendix  B.  Information  on  Additional  Results 

In  Section  7,  we  omitted  some  resulfs  for  alfernale  local  classifiers  (ER  and  kNN)  and/or  alternate 
settings  of  Ip,  since  Ihey  did  nol  noliceably  change  our  reported  Irends.  These  resulfs  are  available 
in  an  online  appendix  lhal  accompanies  fhis  article  on  fhe  JMER  website. 


References 

Regina  Barzilay  and  Mirella  Eapala.  Colleclive  conlenl  selection  for  concepf-lo-lexl  generafion. 
In  Proceedings  of  the  Human  Language  Technology  Conference  and  Conference  on  Empirical 
Methods  in  Natural  Language  Processing  (HLT/EMNLP),  pages  331-338,  2005. 


2831 


McDowell,  Gupta  and  Aha 


Julian  Besag.  On  the  statistical  analysis  of  dirty  pictures.  Journal  of  the  Royal  Statistical  Society, 
48(3):259-302,  1986. 

Julian  Besag.  Spatial  interaction  and  the  statistical  analysis  of  lattice  systems.  Journal  of  the  Royal 
Statistical  Society,  36(2):192-236,  1974. 

Mustafa  Bilgic  and  Lise  Getoor.  Effective  label  acquisition  for  collective  classification.  In  Pro¬ 
ceedings  of  the  14th  ACM  SIGKDD  International  Conference  on  Knowledge  Discovery  and  Data 
Mining,  pages  43-51,  2008. 

Bela  Bollobas,  Christian  Borgs,  Jennifer  Chayes,  and  Oliver  Riordan.  Directed  scale-free  graphs.  In 
Proceedings  of  the  14th  Annual  ACM- SIAM  Symposium  on  Discrete  Algorithms  (SODA),  pages 
132-139,  2003. 

Soumen  Chakrabarti,  Byron  Dom,  and  Piotr  Indyk.  Enhanced  hypertext  categorization  using  hyper¬ 
links.  In  Proceedings  of  the  ACM  SIGMOD  International  Conference  on  Management  of  Data, 
pages  307-318,  1998. 

Mark  Craven,  Dan  DiPasquo,  Dayne  Ereitag,  Andrew  K.  McCallum,  Tom  M.  Mitchell,  Kamal 
Nigam,  and  Sean  Slattery.  Eearning  to  extract  symbolic  knowledge  from  the  World  Wide  Web. 
In  Proceedings  of  the  15th  Conference  of  the  American  Association  for  Artificial  Intelligence 
(AAAI),  pages  509-516,  1998. 

Stephen  Della  Pietra,  Vincent  Della  Pietra,  and  John  Eafferty.  Inducing  features  of  random  fields. 
IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence,  19(4): 3 80-393,  1997. 

Roland  E.  Dobrushin.  The  descripfion  of  a  random  field  by  means  of  conditional  probabilities  and 
conditions  of  ifs  regularity.  Theory  of  Probability  and  its  Applications,  13(2):  197-224,  1968. 

Andrew  Easf  and  David  Jensen.  Why  slacked  models  perform  effective  colleclive  classificalion.  In 
Proceedings  of  the  IEEE  International  Conference  on  Data  Mining  (ICDM),  2008. 

Karen  Yuen  Eung  and  Barbara  A.  Wrobel.  The  frealmenl  of  missing  values  in  logislic  regression. 
Biometrical  Journal,  31(l):35-47,  1989. 

Brian  Gallagher  and  Tina  Eliassi-Rad.  Eeveraging  label-independenl  feafures  for  classificalion  in 
sparsely  labeled  nelworks:  An  empirical  sludy.  In  Proceedings  of  the  2nd  Workshop  on  Social 
Network  Mining  and  Analysis  at  the  I4th  ACM  SIGKDD  International  Conference  on  Knowledge 
Discovery  and  Data  Mining,  2008. 

Brian  Gallagher,  Hanghang  Tong,  Tina  Eliassi-Rad,  and  Chrislos  Ealoulsos.  Using  ghosl  edges  for 
classification  in  sparsely  labeled  nelworks.  In  Proceeding  of  the  14th  ACM  SIGKDD  International 
Conference  on  Knowledge  Discovery  and  Data  Mining,  pages  256-264,  2008. 

Aram  Galslyan  and  Paul  R.  Cohen.  Empirical  comparison  of  “hard”  and  “sofl”  label  propagation 
for  relational  classificalion.  In  Proceedings  of  the  1 7th  International  Conference  on  Inductive 
Logic  Programming  (ILP),  pages  98-111,  2007. 

Tayfun  Gurel  and  Kristian  Kersling.  On  Ihe  Irade-off  belweeen  ilerafive  classificalion  and  colleclive 
classification:  firsl  experimenlal  resulls.  In  Working  Notes  of  the  3rd  International  ECMUPKDD 
Workshop  on  Mining  Graphs,  Trees,  and  Sequences,  2005. 


2832 


Cautious  Collective  Classieication 


David  Heckerman.  A  tutorial  on  learning  with  bayesian  networks.  In  M.  Jordan,  editor,  Learning 
in  Graphical  Models.  MIT  Press,  1999. 

Andreas  HeB  and  Nicholas  Kushmerick.  Iterative  ensemble  classification  for  relational  data:  A  case 
study  of  semantic  web  services.  In  Proceedings  of  the  15th  European  Conference  on  Machine 
Learning  (ECML),  pages  156-167,  2004. 

Cecil  Huang  and  Adnan  Darwiche.  Inference  in  belief  networks:  A  procedural  guide.  International 
Journal  of  Approximate  Reasoning,  15(3):225-263,  1996. 

David  Jensen  and  Jennifer  Neville.  Linkage  and  autocorrelation  cause  feature  selection  bias  in 
relational  learning.  In  Proceedings  of  the  19th  International  Conference  on  Machine  Learning 
(ICML),  pages  259-266,  2002. 

David  Jensen,  Jennifer  Neville,  and  Michael  Hay.  Avoiding  bias  when  aggregating  relational  data 
with  degree  disparity.  In  Proceedings  of  the  20th  International  Conference  on  Machine  Learning 
(ICML),  pages  274-281,  2003. 

David  Jensen,  Jennifer  Neville,  and  Brian  Gallagher.  Why  collective  inference  improves  relational 
classification.  In  Proceedings  of  the  10th  ACM  SIGKDD  International  Conference  on  Knowledge 
Discovery  and  Data  Mining,  pages  593-598,  2004. 

Ron  Kohavi  and  George  H.  John.  Wrappers  for  feature  subset  selection.  Artifical  Intelligence,  97 
(l-2):273-324,  1997. 

Daphne  Roller,  Nir  Friedman,  Lise  Getoor,  and  Benjamin  Taskar.  Graphical  models  in  a  nutshell.  In 
L.  Getoor  and  B.  Taskar,  editors.  An  Introduction  to  Statistical  Relational  Learning.  MIT  Press, 
2007. 

Zhenzhen  Kou  and  William  W.  Cohen.  Stacked  graphical  models  for  efficient  inference  in  Markov 
Random  Fields.  In  Proceedings  of  the  7th  SIAM  International  Conference  on  Data  Mining 
(SDM),  pages  533-538,  2007. 

Qing  Lu  and  Lise  Getoor.  Link-based  classification.  In  Proceedings  of  the  20th  International 
Conference  on  Machine  Learning  (ICML),  pages  496-503,  2003a. 

Qing  Lu  and  Lise  Getoor.  Link-based  classification  using  labeled  and  unlabeled  data.  In  Proceed¬ 
ings  of  the  Workshop  on  the  Continuum  from  Labeled  to  Unlabeled  data  at  the  20th  International 
Conference  on  Machine  Learning  (ICML),  2003b. 

Sofus  A.  Macskassy.  Improving  learning  in  networked  data  by  combining  explicit  and  mined  links. 
In  Proceedings  of  the  22nd  AAAI  Conference  on  Artificial  Intelligence,  pages  590-595,  2007. 

Sofus  A.  Macskassy  and  Foster  Provost.  Suspicion  scoring  based  on  guilt-by-association,  collective 
inference,  and  focused  data  access.  In  Proceedings  of  the  International  Conference  on  Intelli¬ 
gence  Analysis,  2005. 

Sofus  A.  Macskassy  and  Foster  Provost.  Classification  in  networked  data:  A  toolkit  and  a  univariate 
case  study.  Journal  of  Machine  Learning  Research,  8:935-983,  2007. 


2833 


McDowell,  Gupta  and  Aha 


Sofus  A.  Macskassy  and  Foster  Provost.  A  brief  survey  of  maehine  learning  methods  for  elassifiea- 
tion  in  networked  data  and  an  applieation  to  suspieion  seoring.  In  Proceedings  of  the  Workshop  on 
Statistical  Network  Analysis  at  the  23rd  International  Conference  on  Machine  Learning  (ICML), 
2006. 

Andrew  MeCallum,  Dayne  Freitag,  and  Fernando  C.  N.  Pereira.  Maximum  entropy  markov  models 
for  information  extraetion  and  segmentation.  In  Proceedings  of  the  1 7th  International  Conference 
on  Machine  Learning,  pages  591-598,  2000a. 

Andrew  MeCallum,  Kamal  Nigam,  Jason  Rennie,  and  Kristie  Seymore.  Automating  the  eonstrue- 
tion  of  internet  portals  with  maehine  learning.  Information  Retrieval,  3:127-163,  2000b. 

Luke  K.  MeDowell,  Kalyan  Moy  Gupta,  and  David  W.  Aha.  Cautious  inferenee  in  eolleetive  elassi- 
fieation.  In  Proceedings  of  the  22nd  AAAI  Conference  on  Artificial  Intelligence,  pages  596-601, 
2007a. 

Luke  K.  MeDowell,  Kalyan  Moy  Gupta,  and  David  W.  Aha.  Case-based  eolleetive  elassifieation.  In 
Proceedings  of  the  20th  International  Florida  Artificial  Intelligence  Research  Society  Conference 
(FLAIRS),  pages  399-404,  2007b. 

Robert  J.  MeElieee,  David  J.  C.  MaeKay,  and  Jung-Fu  Cheng.  Turbo  deeoding  as  an  instanee  of 
Pearl’s  “belief  propagation”  algorithm.  IEEE  Journal  on  Selected  Areas  in  Communications,  16 
(2):  140-152,  1998. 

Miller  MePherson,  Lynn  Smith-Lovin,  and  James  M.  Cook.  Birds  of  a  feather:  Homophily  in  soeial 
networks.  Annual  Review  of  Sociology,  27:415-444,  2001. 

Kevin  P.  Murphy,  Yair  Weiss,  and  Miehael  1.  Jordan.  Loopy  belief  propagation  for  approximate 
inferenee:  An  empirieal  study.  In  Proceedings  of  the  1 5th  Conference  on  Uncertainty  in  Artificial 
Intelligence,  pages  467M-75,  1999. 

Jennifer  Neville  and  David  Jensen.  A  bias/varianee  deeomposition  for  models  using  eolleetive 
inferenee.  Machine  Learning  Journal,  73(1):87-106,  2008. 

Jennifer  Neville  and  David  Jensen.  Iterative  elassifieation  in  relational  data.  In  Proceedings  of  the 
Workshop  on  Learning  Statistical  Models  from  Relational  Data  at  the  I7th  National  Conference 
on  Artificial  Intelligence  (AAAI),  pages  13-20,  2000. 

Jennifer  Neville  and  David  Jensen.  Leveraging  relational  autoeorrelation  with  latent  group  models. 
In  Proceedings  of  the  5th  IEEE  International  Conference  on  Data  Mining  (ICDM),  pages  170- 
177,  2005. 

Jennifer  Neville  and  David  Jensen.  Relational  dependeney  networks.  Journal  of  Machine  Learning 
Research,  8:653-692,  2007. 

Jennifer  Neville,  David  Jensen,  Lisa  Friedland,  and  Miehael  Hay.  Learning  relational  probability 
trees.  In  Proceedings  of  the  9th  ACM  SIGKDD  International  Conference  on  Knowledge  Discov¬ 
ery  and  Data  Mining,  pages  625-630,  2003a. 


2834 


Cautious  Collective  Classieication 


Jennifer  Neville,  David  Jensen,  and  Brian  Gallagher.  Simple  estimators  for  relational  bayesian 
elassifiers.  In  Proceedings  of  the  Third  IEEE  International  Conference  on  Data  Mining  (ICDM), 
pages  609-612,  2003b. 

Jennifer  Neville,  Ozgiir  Simsek,  David  Jensen,  John  Komoroske,  Kelly  Palmer,  and  Henry  G.  Gold¬ 
berg.  Using  relational  knowledge  diseovery  to  prevent  seeurities  fraud.  In  Proceedings  of  the 
11th  ACM  SIGKDD  International  Conference  on  Knowledge  Discovery  and  Data  Mining,  pages 
449^58,  2005. 

Mark  E.  Newman.  Mixing  patterns  in  networks.  Physical  Review  E,  67(2):026126,  2003. 

Judea  Pearl.  Probabilistic  Reasoning  in  Intelligent  Systems.  Morgan  Kaufmann,  1988. 

Matthew  J.  Rattigan,  Mare  Maier,  David  Jensen,  Bin  Wu,  Xin  Pei,  JianBin  Tan,  and  Yi  Wang:. 
Exploiting  network  strueture  for  aetive  inferenee  in  eolleetive  elassifieation.  In  Proceedings  of  the 
Workshop  on  Mining  Graphs  and  Complex  Structures  at  the  7th  IEEE  International  Conference 
on  Data  Mining  (ICDM),  pages  429^34,  2007. 

Matthew  Riehardson  and  Pedro  Domingos.  Markov  logie  networks.  Machine  Learning,  62(1-2): 
107-136,  2006. 

Maytal  Saar-Tseehansky  and  Poster  Provost.  Handling  missing  values  when  applying  elassifieation 
models.  Journal  of  Machine  Learning  Research,  8(Jul):  1623-1 657,  2007. 

Prithviraj  Sen.  Personal  eommunieation,  2008. 

Prithviraj  Sen  and  Else  Getoor.  Empirieal  eomparison  of  approximate  inferenee  algorithms  for 
networked  data.  In  Proceedings  of  the  Workshop  on  Open  Problems  in  Statistical  Relational 
Learning  at  the  23rd  International  Conference  on  Machine  Learning  (ICML),  2006. 

Prithviraj  Sen  and  Eise  Getoor.  Eink-based  elassifieation.  Teehnieal  Report  CS-TR-4858,  University 
of  Maryland,  College  Park,  MD,  Pebruary  2007. 

Prithviraj  Sen,  Galileo  Namata,  Mustafa  Bilgie,  Eise  Getoor,  Brian  Gallagher,  and  Tina  Eliassi-Rad. 
Colleetive  elassifieation  in  network  data.  AI  Magazine,  Special  Issue  on  AI  and  Networks,  29(3): 
93-106,  2008. 

Ben  Taskar,  Pieter  Abbeel,  and  Daphne  Roller.  Diseriminative  probalistie  models  for  relational 
data.  In  Proceedings  of  the  18th  Conference  on  Uncertainity  in  Artificial  Intelligence  (UAI), 
pages  485^92,  2002. 

YongHong  Tian,  Tiejun  Huang,  and  Wen  Gao.  Patent  linkage  semantie  kernels  for  eolleetive  elas¬ 
sifieation  of  link  data.  Journal  of  Intelligent  Information  Systems,  26(3):269-301,  2006. 

Rudolph  Triebel,  Riehard  Sehmidt,  Osear  Martinez  Mozos,  and  Wolfram  Burgard.  Instanee-based 
AMN  elassifieation  for  improved  objeet  reeognition  in  2D  and  3D  laser  range  data.  In  Proceed¬ 
ings  of  the  20th  International  Joint  Conference  on  Artificial  Intelligence  (IJCAI),  pages  2225- 
2230, 2007. 

Jonathan  S.  Yedidia,  William  T.  Preeman,  and  Yair  Weiss.  Generalized  belief  propagation.  Advances 
in  Neural  Information  Processing  Systems  (NIPS),  13:689-695,  2000. 


2835 


McDowell,  Gupta  and  Aha 


Nevin  Lianwen  Zhang  and  David  Poole.  Exploiting  causal  independence  in  bayesian  network  in¬ 
ference.  Journal  of  Artificial  Intelligence  Research,  5:301-328,  1996. 

Bin  Zhao,  Prithviraj  Sen,  and  Lise  Getoor.  Event  classification  and  relationship  labeling  in  affiliation 
networks.  In  Proceedings  of  the  Workshop  on  Statistical  Network  Analysis  (SNA)  at  the  23rd 
International  Conference  on  Machine  Learning  (ICML),  2006. 


2836 


